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Preface 


Over  the  past  decades,  time  series  analysis  has  experienced  a  proliferous  increase  of 
applications  in  economics,  especially  in  macroeconomics  and  finance.  Today  these 
tools  have  become  indispensable  to  any  empirically  working  economist.  Whereas  in 
the  beginning  the  transfer  of  knowledge  essentially  flowed  from  the  natural  sciences, 
especially  statistics  and  engineering,  to  economics,  over  the  years  theoretical  and 
applied  techniques  specifically  designed  for  the  nature  of  economic  time  series 
and  models  have  been  developed.  Thereby,  the  estimation  and  identification  of 
structural  vector  autoregressive  models,  the  analysis  of  integrated  and  cointegrated 
time  series,  and  models  of  volatility  have  been  extremely  fruitful  and  far-reaching 
areas  of  research.  With  the  award  of  the  Nobel  Prizes  to  Clive  W.  J.  Granger  and 
Robert  F.  Engle  III  in  2003  and  to  Thomas  J.  Sargent  and  Christopher  A.  Sims  in 
2011,  the  field  has  reached  a  certain  degree  of  maturity.  Thus,  the  idea  suggests 
itself  to  assemble  the  vast  amount  of  material  scattered  over  many  papers  into  a 
comprehensive  textbook. 

The  book  is  self-contained  and  addresses  economics  students  who  have  already 
some  prerequisite  knowledge  in  econometrics.  It  is  thus  suited  for  advanced 
bachelor,  master’s,  or  beginning  PhD  students  but  also  for  applied  researchers.  The 
book  tries  to  bring  them  in  a  position  to  be  able  to  follow  the  rapidly  growing 
research  literature  and  to  implement  these  techniques  on  their  own.  Although  the 
book  is  trying  to  be  rigorous  in  terms  of  concepts,  definitions,  and  statements 
of  theorems,  not  all  proofs  are  carried  out.  This  is  especially  true  for  the  more 
technically  and  lengthy  proofs  for  which  the  reader  is  referred  to  the  pertinent 
literature. 

The  book  covers  approximately  a  two-semester  course  in  time  series  analysis 
and  is  divided  in  two  parts.  The  first  part  treats  univariate  time  series,  in  particular 
autoregressive  moving-average  processes.  Most  of  the  topics  are  standard  and  can 
form  the  basis  for  a  one-semester  introductory  time  series  course.  This  part  also 
contains  a  chapter  on  integrated  processes  and  on  models  of  volatility.  The  latter 
topics  could  be  included  in  a  more  advanced  course.  The  second  part  is  devoted  to 
multivariate  time  series  analysis  and  in  particular  to  vector  autoregressive  processes. 
It  can  be  taught  independently  of  the  first  part.  The  identification,  modeling,  and 
estimation  of  these  processes  form  the  core  of  the  second  part.  A  special  chapter 
treats  the  estimation,  testing,  and  interpretation  of  cointegrated  systems.  The  book 
also  contains  a  chapter  with  an  introduction  to  state  space  models  and  the  Kalman 


v 
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Preface 


filter.  Whereas  the  books  is  almost  exclusively  concerned  with  linear  systems,  the 
last  chapter  gives  a  perspective  on  some  more  recent  developments  in  the  context 
of  nonlinear  models.  I  have  included  exercises  and  worked  out  examples  to  deepen 
the  teaching  and  learning  content.  Finally,  I  have  produced  five  appendices  which 
summarize  important  topics  such  as  complex  numbers,  linear  difference  equations, 
and  stochastic  convergence. 

As  time  series  analysis  has  become  a  tremendously  growing  field  with  an  active 
research  in  many  directions,  it  goes  without  saying  that  not  all  topics  received  the 
attention  they  deserved  and  that  there  are  areas  not  covered  at  all.  This  is  especially 
true  for  the  recent  advances  made  in  nonlinear  time  series  analysis  and  in  the 
application  of  Bayesian  techniques.  These  two  topics  alone  would  justify  an  extra 
book. 

The  data  manipulations  and  computations  have  been  performed  using  the 
software  packages  EVIEWS  and  MATLAB.1  Of  course,  there  are  other  excellent 
packages  available.  The  data  for  the  examples  and  additional  information  can 
be  downloaded  from  my  home  page  www.neusser.ch.  To  maximize  the  learning 
success,  it  is  advised  to  replicate  the  examples  and  to  perform  similar  exercises 
with  alternative  data.  Interesting  macroeconomic  time  series  can,  for  example,  be 
downloaded  from  the  following  home  pages: 

Germany:  www.bundesbank.de 
Switzerland:  www.snb.ch 
United  Kingdom:  www.statistics.gov.uk 
United  States:  research.stlouisfed.org/fred2 

The  book  grew  out  of  lectures  which  I  had  the  occasion  to  give  over  the  years 
in  Bern  and  other  universities.  Thus,  it  is  a  concern  to  thank  the  many  students, 
in  particular  Philip  Letsch,  who  had  to  work  through  the  manuscript  and  who 
called  my  attention  to  obscurities  and  typos.  I  also  want  to  thank  my  colleagues 
and  teaching  assistants  Andreas  Bachmann,  Gregor  Baurle,  Fabrice  Collard,  Sarah 
Fischer,  Stephan  Leist,  Senada  Nukic,  Kurt  Schmidheiny,  Reto  Tanner,  and  Martin 
Wagner  for  reading  the  manuscript  or  part  of  it  and  for  making  many  valuable 
criticisms  and  comments.  Special  thanks  go  to  my  former  colleague  and  coauthor 
Robert  Kunst  who  meticulously  read  and  commented  on  the  manuscript.  It  goes 
without  saying  that  all  errors  and  shortcomings  go  to  my  expense. 

Bern,  Switzerland/Eggenburg,  Austria  Klaus  Neusser 

February  2016 


'EVIEWS  is  a  product  of  IHS  Global  Inc.  MATLAB  is  a  matrix-oriented  software  developed  by 
Math  Works  which  is  ideally  suited  for  econometric  and  time  series  applications. 
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Univariate  Time  Series  Analysis 


Introduction  and  Basic  Theoretical  Concepts 


Time  series  analysis  is  an  integral  part  of  every  empirical  investigation  which  aims 
at  describing  and  modeling  the  evolution  over  time  of  a  variable  or  a  set  of  variables 
in  a  statistically  coherent  way.  The  economics  of  time  series  analysis  is  thus  very 
much  intermingled  with  macroeconomics  and  finance  which  are  concerned  with  the 
construction  of  dynamic  models.  In  principle,  one  can  approach  the  subject  from 
two  complementary  perspectives.  The  first  one  focuses  on  descriptive  statistics. 
It  characterizes  the  empirical  properties  and  regularities  using  basic  statistical 
concepts  like  mean,  variance,  and  covariance.  These  properties  can  be  directly 
measured  and  estimated  from  the  data  using  standard  statistical  tools.  Thus,  they 
summarize  the  external  (observable)  or  outside  characteristics  of  the  time  series.  The 
second  perspective  tries  to  capture  the  internal  data  generating  mechanism.  This 
mechanism  is  usually  unknown  in  economics  as  the  models  developed  in  economic 
theory  are  mostly  of  a  qualitative  nature  and  are  usually  not  specific  enough  to 
single  out  a  particular  mechanism.1  Thus,  one  has  to  consider  some  larger  class 
of  models.  By  far  most  widely  used  is  the  class  of  autoregressive  moving-average 
(ARMA)  models  which  rely  on  linear  stochastic  difference  equations  with  constant 
coefficients.  Of  course,  one  wants  to  know  how  the  two  perspectives  are  related 
which  leads  to  the  important  problem  of  identifying  a  model  from  the  data. 

The  observed  regularities  summarized  in  the  form  of  descriptive  statistics  or  as  a 
specific  model  are,  of  course,  of  principal  interest  to  economics.  They  can  be  used 
to  test  particular  theories  or  to  uncover  new  features.  One  of  the  main  assumptions 
underlying  time  series  analysis  is  that  the  regularities  observed  in  the  sample  period 


1  One  prominent  exception  is  the  random-walk  hypothesis  of  real  private  consumption  first  derived 
and  analyzed  by  Hall  (1978).  This  hypothesis  states  that  the  current  level  of  private  consumption 
should  just  depend  on  private  consumption  one  period  ago  and  on  no  other  variable,  in  particular 
not  on  disposable  income.  The  random-walk  property  of  asset  prices  is  another  very  much 
discussed  hypothesis.  See  Campbell  et  al.  (1997)  for  a  general  exposition  and  Samuelson  (1965) 
for  a  first  rigorous  derivation  from  market  efficiency. 
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are  not  specific  to  that  period,  but  can  be  extrapolated  into  the  future.  This  leads  to 
the  issue  of  forecasting  which  is  another  major  application  of  time  series  analysis. 

Although  its  roots  lie  in  the  natural  sciences  and  in  engineering,  time  series 
analysis,  since  the  early  contributions  by  Frisch  (1933)  and  Slutzky  (1937),  has 
become  an  indispensable  tool  in  empirical  economics.  Early  applications  mostly 
consisted  in  making  the  knowledge  and  methods  acquired  there  available  to 
economics.  However,  with  the  progression  of  econometrics  as  a  separate  scientific 
field,  more  and  more  techniques  that  are  specific  to  the  characteristics  of  economic 
data  have  been  developed.  I  just  want  to  mention  the  analysis  of  univariate  and 
multivariate  integrated,  respectively  cointegrated  time  series  (see  Chaps.  7  and  16), 
the  identification  of  vector  autoregressive  (VAR)  models  (see  Chap.  15),  and  the 
analysis  of  volatility  of  financial  market  data  in  Chap.  8.  Each  of  these  topics 
alone  would  justify  the  treatment  of  time  series  analysis  in  economics  as  a  separate 
subfield. 


1.1  Some  Examples 

Before  going  into  more  formal  analysis,  it  is  useful  to  examine  some  prototypical 
economic  time  series  by  plotting  them  against  time.  This  simple  graphical  inspection 
already  reveals  some  of  the  issues  encountered  in  this  book.  One  of  the  most  popular 
time  series  is  the  real  gross  domestic  product.  Figure  1.1  plots  the  data  for  the 
U.S.  from  1947  first  quarter  to  2011  last  quarter  on  logarithmic  scale.  Several 
observations  are  in  order.  First,  the  data  at  hand  cover  just  a  part  of  the  time  series. 
There  are  data  available  before  1 947  and  there  will  be  data  available  after  20 1 1 .  As 
there  is  no  natural  starting  nor  end  point,  we  think  of  a  time  series  as  extending  back 
into  the  infinite  past  and  into  the  infinite  future.  Second,  the  observations  are  treated 
as  the  realizations  of  a  random  mechanism.  This  implies  that  we  observe  only  one 
realization.  If  we  could  turn  back  time  and  let  run  history  again,  we  would  obtain 
a  second  realization.  This  is,  of  course,  impossible,  at  least  in  the  macroeconomics 
context.  Thus,  typically,  we  are  faced  with  just  one  realization  on  which  to  base  our 
analysis.  However,  sound  statistical  analysis  needs  many  realizations.  This  implies 
that  we  have  to  make  some  assumption  on  the  constancy  of  the  random  mechanism 
over  time.  This  leads  to  the  concept  of  stationarity  which  will  be  introduced  more 
rigorously  in  the  next  section.  Third,  even  a  cursory  look  at  the  plot  reveals  that 
the  mean  of  real  GDP  is  not  constant,  but  is  upward  trending.  As  we  will  see,  this 
feature  is  typical  of  many  economic  time  series.2  The  investigation  into  the  nature 
of  the  trend  and  the  statistical  consequences  thereof  have  been  the  subject  of  intense 
research  over  the  last  couple  of  decades.  Fourth,  a  simple  way  to  overcome  this 


2See  footnote  1  for  some  theories  predicting  non-stationary  behavior. 
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Fig.  1.1  Real  gross  domestic  product  (GDP)  of  the  U.S.  (chained  2005  dollars;  seasonally 
adjusted  annual  rate) 


Fig.  1 .2  Quarterly  growth  rate  of  U.S.  real  gross  domestic  product  (GDP)  (chained  2005  dollars) 


problem  is  to  take  first  differences.  As  the  data  have  been  logged,  this  amounts  to 
taking  growth  rates.3  The  corresponding  plot  is  given  in  Fig.  1 .2  which  shows  no 
trend  anymore. 

Another  feature  often  encountered  in  economic  time  series  is  seasonality.  This 
issue  arises,  for  example  in  the  case  of  real  GDP,  because  of  a  particular  regularity 
within  a  year:  the  first  quarter  being  the  quarter  with  the  lowest  values,  the  second 


3This  is  obtained  by  using  the  approximation  ln(l  +  g)  ss  e  for  small  e  where  e  equals  the  growth 
rate  of  GDP. 
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Fig.  1.3  Comparison  of  unadjusted  and  seasonally  adjusted  Swiss  real  gross  domestic  product 
(GDP) 

and  fourth  quarter  those  with  the  highest  values,  and  the  third  quarter  being  in 
between.  These  movements  are  due  to  climatical  and  holiday  seasonal  variations 
within  the  year  and  are  viewed  to  be  of  minor  economic  importance.  Moreover, 
these  seasonal  variations,  because  of  there  size,  hide  the  more  important  business 
cycle  movements.  It  is  therefore  customary  to  work  with  time  series  which  have 
been  adjusted  for  seasonality  before  hand.  Figure  1.3  shows  the  unadjusted  and 
the  adjusted  real  gross  domestic  product  for  Switzerland.  The  adjustment  has  been 
achieved  by  taking  a  moving-average.  This  makes  the  time  series  much  smoother 
and  evens  out  the  seasonal  movements. 

Other  typical  economic  time  series  are  interest  rates  plotted  in  Fig.  1 .4.  Over  the 
period  considered  these  two  variables  also  seem  to  trend.  However,  the  nature  of 
this  trend  must  be  different  because  of  the  theoretically  binding  zero  lower  bound. 
Although  the  relative  level  of  the  two  series  changes  over  time — at  the  beginning  of 
the  sample,  short-term  rates  are  higher  than  long-terms  ones — they  move  more  or 
less  together.  This  comovement  is  true  in  particular  true  with  respect  to  the  medium- 
and  long-term. 

Other  prominent  time  series  are  stock  market  indices.  In  Fig.  1.5  the  Swiss 
Market  Index  (SMI)  in  plotted  as  an  example.  The  first  panel  displays  the  raw 
data  on  a  logarithmic  scale.  One  can  clearly  discern  the  different  crises:  the  internet 
bubble  in  2001  and  the  most  recent  financial  market  crisis  in  2008.  More  interesting 
than  the  index  itself  is  the  return  on  the  index  plotted  in  the  second  panel.  Whereas 
the  mean  seems  to  stay  relatively  constant  over  time,  the  volatility  is  not:  in  the 
periods  of  crisis  volatility  is  much  higher.  This  clustering  of  volatility  is  a  typical 
feature  of  financial  market  data  and  will  be  analyzed  in  detail  in  Chap.  8. 
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Finally,  Fig.  1.6  plots  the  unemployment  rate  for  Switzerland.  This  is  another 
widely  discussed  time  series.  However,  the  Swiss  data  have  a  particular  feature 
in  that  the  behavior  of  the  series  changes  over  time.  Whereas  unemployment  was 
practically  nonexistent  in  Switzerland  up  to  the  end  of  1990’s,  several  policy 
changes  (introduction  of  unemployment  insurance,  liberalization  of  immigration 
laws)  led  to  drastic  shifts.  Although  such  dramatic  structural  breaks  are  rare,  one 
has  to  be  always  aware  of  such  a  possibility.  Reasons  for  breaks  are  policy  changes 
and  simply  structural  changes  in  the  economy  at  large.4 


1.2  Formal  Definitions 

The  previous  section  attempted  to  give  an  intuitive  approach  of  the  subject.  The 
analysis  to  follow  necessitates,  however,  more  precise  definitions  and  concepts. 
At  the  heart  of  the  exposition  stands  the  concept  of  a  stochastic  process.  For  this 
purpose  we  view  the  observation  at  some  time  t  as  the  realization  of  random 
variable  Xt.  In  time  series  analysis  we  are,  however,  in  general  not  interested  in  a 
particular  point  in  time,  but  rather  in  a  whole  sequence.  This  leads  to  the  following 
definition. 

Definition  1.1.  A  stochastic  process  { X,  J  is  a  family  of  random  variables  indexed 
by  t  e  T  and  defined  on  some  given  probability  space. 


4Burren  and  Neusser  (2013)  investigate,  for  example,  how  systematic  sectoral  shifts  affect  volatility 
of  real  GDP  growth. 
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2/1/1989  30/9/1992  31/7/1996  31/5/2000  31/3/2004  30/1/2008  30/11/2011 

Fig.  1.5  Swiss  Market  Index  (SMI),  (a)  Index,  (b)  Daily  return 


Thereby  T  denotes  an  ordered  index  set  which  is  typically  identified  with  time. 
In  the  literature  one  can  encounter  the  following  index  sets: 

discrete  time:  T  =  {1, 2, . . .}  =  IN 

discrete  time:  T  =  — 2,  —  1, 0,  1,2 _ }  =  Z 

continuous  time:  T  =  [0,  oo)  =  R+  or  T  =  (— oo,  oo)  =  R 
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Fig.  1 .6  Unemployment  rate  in  Switzerland 


Remark  1.1.  Given  that  T  is  identified  with  time  and  thus  has  a  direction,  a 
characteristic  of  time  series  analysis  is  the  distinction  between  past,  present,  and 
future. 

For  technical  reasons  which  will  become  clear  later,  we  will  work  with  T  = 
Z,  the  set  of  integers.  This  choice  is  consistent  with  the  use  of  time  indices  in 
economics  as  there  is,  usually,  no  natural  starting  point  nor  a  foreseeable  endpoint. 
Although  models  in  continuous  time  are  well  established  in  the  theoretical  finance 
literature,  we  will  disregard  them  because  observations  are  always  of  a  discrete 
nature  and  because  models  in  continuous  time  would  need  substantially  higher 
mathematical  requirements. 

Remark  1.2.  The  random  variables  {A,}  take  values  in  a  so-called  state  space.  In 
the  first  part  of  this  treatise,  we  take  as  the  state  space  the  space  of  real  numbers  R 
and  thus  consider  only  univariate  time  series.  In  part  II  we  extend  the  state  space  to 
R"  and  study  multivariate  times  series.  Theoretically,  it  is  possible  to  consider  other 
state  spaces  (for  example,  {0, 1 },  the  integers,  or  the  complex  numbers),  but  this  will 
not  be  pursued  here. 

Definition  1.2.  The  function  t  — >■  x,  which  assigns  to  each  point  in  time  t  the 
realization  of  the  random  variable  X,,  xt,  is  called  a  realization  or  a  trajectory  of 
the  stochastic  process.  We  denote  such  a  realization  by  {a,}. 
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We  denominate  by  a  time  series  the  realization  or  trajectory  (observations  or 
data),  or  the  underlying  stochastic  process.  Usually,  there  is  no  room  for  misun¬ 
derstandings.  A  trajectory  therefore  represents  one  observation  of  the  stochastic 
process.  Whereas  in  standard  statistics  a  sample  consists  of  several,  typically, 
independent  draws  from  the  same  distribution,  a  sample  in  time  series  analysis 
is  just  one  trajectory.  Thus,  we  are  confronted  with  a  situation  where  there  is  in 
principle  just  one  observation.  We  cannot  turn  back  the  clock  and  get  additional 
trajectories.  The  situation  is  even  worse  as  we  typically  observe  only  the  realizations 
in  a  particular  time  window.  For  example,  we  might  have  data  on  US  GDP  from 
the  first  quarter  1960  up  to  the  last  quarter  in  2011.  But  it  is  clear,  the  United 
States  existed  before  1960  and  will  continue  to  exist  after  2011,  so  that  there  are 
in  principle  observations  before  1960  and  after  201 1.  In  order  to  make  a  meaningful 
statistical  analysis,  it  is  therefore  necessary  to  assume  that  the  observed  part  of  the 
trajectory  is  typical  for  the  time  series  as  a  whole.  This  idea  is  related  to  the  concept 
of  stationarity  which  we  will  introduce  more  formally  below.  In  addition,  we  want 
to  require  that  the  observations  cover  in  principle  all  possible  events.  This  leads  to 
the  concept  of  ergodicity.  We  avoid  a  formal  definition  of  ergodicity  as  this  would 
require  a  sizeable  amount  of  theoretical  probabilistic  background  material  which 
goes  beyond  the  scope  this  treatise.5 

An  important  goal  of  time  series  analysis  is  to  build  a  model  given  the  realization 
(data)  at  hand.  This  amounts  to  specify  the  joint  distribution  of  some  set  of  Xjs  with 
corresponding  realization  {x,}. 

Definition  1.3  (Model).  A  time  series  model  or  a  model  for  the  observations  (data) 
{x,}  is  a  specification  of  the  joint  distribution  of  {X,}  for  which  {x,}  is  a  realization. 

The  Kolmogorov  existence  theorem  ensures  that  the  specification  of  all  finite 
dimensional  distributions  is  sufficient  to  characterize  the  whole  stochastic  process 
(see  Billingsley  (1986),  Brockwell  and  Davis  (1991),  or  Kallenberg  (2002)). 

Most  of  the  time  it  is  too  involved  to  specify  the  complete  distribution  so  that 
one  relies  on  only  the  first  two  moments.  These  moments  are  then  given  by  the 
means  EX’,,  the  variances  VX,,  t  e  Z,  and  the  covariances  cov(X,.  Xv)  =  E(X,  — 
EX,)(Xj  —  EXS)  =  E(X,  Xj)  —  EX, EXj,  respectively  the  correlations  corr(X,,Xs)  = 
cov(X,,  Xj) / (VVX, y/VXs),  f,  s  e  X.  If  the  random  variables  are  jointly  normally 
distributed  then  the  specification  of  the  first  two  moments  is  sufficient  to  characterize 
the  whole  distribution. 


5  In  the  theoretical  probability  theory  ergodicity  is  an  important  concept  which  asks  the  question 
under  which  conditions  the  time  average  of  a  property  is  equal  to  the  corresponding  ensemble 
average,  i.e.  the  average  over  the  entire  state  space.  In  particular,  ergodicity  ensures  that  the 
arithmetic  averages  over  time  converge  to  their  theoretical  counterparts.  In  Chap.  4  we  allude  to 
this  principle  in  the  estimation  of  the  mean  and  the  autocovariance  function  of  a  time  series. 


1.3  Stationarity 
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Examples  of  Stochastic  Processes 

•  {X,}  is  a  sequence  of  independently  distributed  random  variables  with  values  in 
{—1,  1}  such  that  P[X,  =  1]  =  P[Xt  =  —1]  =  1/2.  X,  represents,  for  example, 
the  payoff  after  tossing  a  coin:  if  head  occurs  one  gets  a  Euro  whereas  if  tail 
occurs  one  has  to  pay  a  Euro. 

•  The  simple  random  walk  { S,  |  is  defined  by 

t 

S,  =  St- 1  +  X,  =  with  t  >  0  and  So  =  0, 

i=t 

where  {X,}  is  the  process  from  the  example  just  above.  In  this  case  St  is 
the  proceeds  after  t  rounds  of  coin  tossing.  More  generally,  { X, j  could  be 
any  sequence  of  identically  and  independently  distributed  random  variables. 

Figure  1 .7  shows  a  realization  of  {Xt }  for  t  =  1,2, _ 100  and  the  corresponding 

random  walk  {St}.  For  more  on  random  walks  see  Sect.  1.4.4  and,  in  particular. 
Chap.  7. 

•  The  simple  branching  process  is  defined  through  the  recursion 

x, 

X,+  i  =  EZ'/  with  starting  value:  Xq  =  jco. 
j=  i 

In  this  example  X,  represents  the  size  of  a  population  where  each  member  lives 
just  one  period  and  reproduces  itself  with  some  probability.  ZtJ  thereby  denotes 
the  number  of  offsprings  of  the  j-th  member  of  the  population  in  period  t. 
In  the  simplest  case  {Zt/}  is  nonnegative  integer  valued  and  identically  and 
independently  distributed.  A  realization  with  X0  =  100  and  with  probabilities 
of  one  third  each  that  the  member  has  no,  one,  or  two  offsprings  is  shown  as  an 
example  in  Fig.  1.8. 


1.3  Stationarity 

An  important  insight  in  time  series  analysis  is  that  the  realizations  in  different 
periods  are  related  with  each  other.  The  value  of  GDP  in  some  year  obviously 
depends  on  the  values  from  previous  years.  This  temporal  dependence  can  be 
represented  either  by  an  explicit  model  or,  in  a  descriptive  way,  by  covariances, 
respectively  correlations.  Because  the  realization  of  X,  in  some  year  t  may  depend, 
in  principle,  on  all  past  realizations  Xt-i ,  Xt~i, . . . ,  we  do  not  have  to  specify  just 
a  finite  number  of  covariances,  but  infinitely  many  covariances.  This  leads  to  the 
concept  of  the  covariance  function.  The  covariance  function  is  not  only  a  tool  for 
summarizing  the  statistical  properties  of  a  time  series,  but  is  also  instrumental  in 
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Fig.  1.7  Realization  of  a  random  walk 


Fig.  1.8  Realization  of  a  branching  process 


the  derivation  of  forecasts  (Chap.  3),  in  the  estimation  of  ARMA  models,  the  most 
important  class  of  models  (Chap.  5),  and  in  the  Wold  representation  (Sect.  3.2  in 
Chap.  3).  It  is  therefore  of  utmost  importance  to  get  a  thorough  understanding  of  the 
meaning  and  properties  of  the  covariance  function. 


1.3  Stationarity 
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Definition  1.4  (Autocovariance  Function).  Let  J  X,  |  be  a  stochastic  process  with 
YX,  <  oo  for  all  t  e  Z  then  the  function  which  assigns  to  any  two  time  periods 
t  and  s,  t.s  €  Z,  the  covariance  between  Xt  and  Xs  is  called  the  autocovariance 
function  of  {X,}.  The  autocovariance  function  is  denoted  by  yx{t.  s).  Formally  this 
function  is  given  by 


yx(t,  s)  =  cov(X,,Xs)  =  E  [(X,  -  EX,)(XS  -  EX,)]  =  EX,X,  -  EX, EXS. 


Remark  1.3.  The  acronym  auto  emphasizes  that  the  covariance  is  computed  with 
respect  to  the  same  variable  taken  at  different  points  in  time.  Alternatively,  one  may 
use  the  term  covariance  function  for  short. 

Definition  1.5  (Stationarity).  A  stochastic  process  {X,}  is  called  stationary  if  and 
only  if  for  all  integers  r,  s  and  t  the  following  properties  hold: 

(i)  EX,  =  p  constant; 

(ii)  YX,  <  oo; 

(iii)  yx(t ,  s)  =  yx(t  +  r,s  +  r). 

Remark  1.4.  Processes  with  these  properties  are  often  called  weakly  stationary, 
wide-sense  stationary,  covariance  stationary,  or  second  order  stationary.  As  we  will 
not  deal  with  other  forms  of  stationarity,  we  just  speak  of  stationary  processes,  for 
short. 

Remark  1.5.  For  t  =  s,  we  have  yx(t,  s)  =  yx(t,  t)  =  YX,  which  is  nothing  but  the 
unconditional  variance  of  X,.  Thus,  if  {X,}  is  stationary  yx(t ,  t)  =  YX,  =  constant. 

Remark  1.6.  If  {X,}  is  stationary,  by  setting  r  =  —s  the  autocovariance  function 
becomes: 


yx(t,s)  =  yx(t-s,0). 


Thus  the  covariance  yx(L  .v)  does  not  depend  on  the  points  in  time  t  and  s,  but  only 
on  the  number  of  periods  t  and  .v  are  apart  from  each  other,  i.e.  from  t  —  s.  For 
stationary  processes  it  is  therefore  possible  to  view  the  autocovariance  function  as  a 
function  of  just  one  argument.  We  denote  the  autocovariance  function  in  this  case  by 
yx{h),  h  €  Z.  Because  the  covariance  is  symmetric  in  t  and  s,  i.e.  yx(t.  s)  =  yx(s,  t), 
we  have 


Yx(h)  =  yx(-h)  for  all  integers  h. 

It  is  thus  sufficient  to  look  at  the  autocovariance  function  for  positive  integers  only, 

i.e,  for  h  =  0, 1, 2 . In  this  case  we  refer  to  h  as  the  order  of  the  autocovariance. 

For  h  =  0,  we  get  the  unconditional  variance  of  X,,  i.e.  yx( 0)  =  YX,. 
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In  practice  it  is  more  convenient  to  look  at  the  autocorrelation  coefficients  instead 
of  the  autocovariances.  The  autocorrelation  function  (ACF)  for  stationary  processes 
is  defined  as: 


Px(h)  =  =  corr(Xt+h,Xt)  for  all  integers  h 

yx(0) 

where  h  is  referred  to  as  the  order.  Note  that  this  definition  is  equivalent  to  the 
ordinary  correlation  coefficients  p(h)  =  because  stationarity  implies 

_  V^' Xf\vXt—h 

that  V X,  =  YXt-i,  so  that  V  VX, V VX,_/,  =  YX,  =  yx(0). 

Most  of  the  time  it  is  sufficient  to  concentrate  on  the  first  two  moments.  However, 
there  are  situations  where  it  is  necessary  to  look  at  the  whole  distribution.  This  leads 
to  the  concept  of  strict  stationarity. 

Definition  1.6  (Strict  Stationarity).  A  stochastic  process  is  called  strictly  stationary 
if  the  joint  distributions  o/(X(l , . . . ,  X,n )  and  (Xfl+/,, . . . ,  X,n+/,)  are  the  same  for  all 
h  e  Z  and  all  {t\, ... ,  tn )  e  Tn,  n  =  1,2,... 

Definition  1.7  (Strict  Stationarity).  A  stochastic  process  is  called  strictly  stationary 
if  for  all  integers  h  and  n  >  1  (X\, . . .  ,Xn)  and  {X\+h,  ■  ■  ■  ,Xn+it)  have  the  same 
distribution. 


Remark  1. 7.  Both  definitions  are  equivalent. 


Remark  1.8.  If  { X,  j  is  strictly  stationary  then  X,  has  the  same  distribution  for  all 
t  (n=l).  For  7i  =  2  we  have  that  X,+i,  and  X,  have  a  joint  distribution  which  is 
independent  of  t.  This  implies  that  the  covariance,  if  it  exists,  depends  only  on  h. 
Thus,  every  strictly  stationary  process  with  YXt  <  oo  is  also  stationary.6 
The  converse  is,  however,  not  true  as  shown  by  the  following  example: 


X, 


exponentially  distributed  with  mean  1  (i.e.  f(x)  =  e  *),  t  uneven; 
N(l,  1),  t  even; 


whereby  the  X,’s  are  independently  distributed.  In  this  example  we  have: 


•  EX,  =  1 

*  yx(0)  =  1  and  yx(h)  =  0  for  h  0 

Thus  {X,}  is  stationary,  but  not  strictly  stationary,  because  the  distribution  changes 
depending  on  whether  t  is  even  or  uneven. 


6An  example  of  a  process  which  is  strictly  stationary,  but  not  stationary,  is  given  by  the  IGARCH 
process  (see  Sect.  8.1.4).  This  process  is  strictly  stationary  with  infinite  variance. 


1 .4  Construction  of  Stochastic  Processes 
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Definition  1.8  (Gaussian  Process).  A  stochastic  process  { X,  j  is  called  a  Gaussian 
process  if  all  finite  dimensional  distributions  ( Xn , . . .  ,Xtn )  with  (t\, . . . ,  t„)  e  l~n, 
n  =  1,2,...,  are  multivariate  normally  distributed. 

Remark  1.9.  A  Gaussian  process  is  obviously  strictly  stationary.  For  all 
7t,  h,  t\, . . . ,  tn,  (Xn, . . .  ,Xtn)  and  (Xtl+h, . . .  ,X,n+h)  have  the  same  mean  and  the 
same  covariance  matrix. 

At  this  point  we  will  not  delve  into  the  relation  between  stationarity,  strict 
stationarity  and  Gaussian  processes,  rather  some  of  these  issues  will  be  further 
discussed  in  Chap.  8. 


1 .4  Construction  of  Stochastic  Processes 

One  important  notion  in  time  series  analysis  is  to  build  up  more  complicated  process 
from  simple  ones.  The  simplest  building  block  is  a  process  with  zero  autocorrelation 
called  a  white  noise  process  which  is  introduced  below.  Taking  moving-averages 
from  this  process  or  using  it  in  a  recursion  gives  rise  to  more  sophisticated  process 
with  more  elaborated  autocovariance  functions.  Slutzky  (1937)  first  introduced  the 
idea  that  moving-averages  of  simple  processes  can  generate  time  series  whose 
motion  resembles  business  cycle  fluctuations. 


1.4.1  White  Noise 

The  simplest  building  block  is  a  process  with  zero  autocorrelation  called  a  white 
noise  process. 

Definition  1.9  (White  Noise).  A  stationary  process  {Z,}  is  called  a  white  noise 
process  if{Z,}  satisfies: 


•  EZ,  =  0 

*  Yz(h )  = 


h  =  0; 
h  ±  0. 


We  denote  this  by  Z,  ~  WN(0,  a1). 

The  white  noise  process  is  therefore  stationary  and  temporally  uncorrelated, 
i.e.  the  ACF  is  always  equal  to  zero,  except  for  h  =  0  where  it  is  equal  to  one. 
As  the  ACF  possesses  no  structure,  it  is  impossible  to  draw  inferences  from  past 
observations  to  its  future  development,  at  least  in  a  least  square  setting  with  linear 
forecasting  functions  (see  Chap.  3).  Therefore  one  can  say  that  a  white  noise  process 
has  no  memory. 
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If  { Z ,}  is  not  only  temporally  uncorrelated,  but  also  independently  and  identically 
distributed,  we  write  Z,  ~  IID(0.  o2).  If  in  addition  Z,  is  normally  distributed,  we 
write  Z,  ~  IIN(0,  a2).  An  IID(0,  a2)  process  is  always  a  white  noise  process.  The 
converse  is,  however,  not  true  as  will  be  shown  in  Chap.  8. 


1 .4.2  Construction  of  Stochastic  Processes:  Some  Examples 

We  will  now  illustrate  how  complex  stationary  processes  can  be  constructed  by 
manipulating  of  a  white  noise  process.  In  Table  1.1  we  report  in  column  2  the 
first  6  realizations  of  a  white  noise  process  {Z,}.  Figure  1.9a  plots  the  first  100 
observations.  We  can  now  construct  a  new  process  {Xf(MA)}  by  taking  moving- 
averages  over  adjacent  periods.  More  specifically,  we  take  X,  =  Z,  +  0.9Zf_i , 

t  =  2,  3, _ Thus,  the  realization  of  {Z,(MA)}  in  period  2  is  =  —0.8718  + 

0.9  x  0.2590  =  —0.6387. 7  The  realization  in  period  3  is  =  -0.7879  + 

0.9  x  —0.8718  =  —1.5726,  and  so  on.  The  resulting  realizations  of  {Z,(MA)}  for 
t  =  2, ...  ,6  are  reported  in  the  third  column  of  Table  1 . 1  and  the  plot  is  shown  in 
Fig.  1.9b.  On  can  see  that  the  averaging  makes  the  series  more  smooth.  In  Sect.  1.4.3 
we  will  provide  a  more  detailed  analysis  of  this  moving-average  process. 

Another  construction  device  is  a  recursion:  x\AR)  =  (pX^^  +  Zt,  t  =  2,3,.. ., 
with  starting  value  X^AR)  =  Z\ .  Such  a  process  is  called  autoregressive  because  it 
refers  to  its  own  past.  Taking  cp  =  0.9,  the  realization  of  [x]AR)\  in  period  2  is 
{„4AS)}  =  -0.6387  =  0.9  x  0.2590  -  0.8718,  in  period  3  {x(AR>}  =  -1.3627  = 
0.9  x  —0.6387  —  0.7879,  and  so  on.  Again  the  resulting  realizations  of  {X,<AR)}  for 

t  =  2 _ ,6  are  reported  in  the  fourth  column  of  Table  1 . 1  and  the  plot  is  shown  in 

Fig.  1.9c.  On  can  see  how  the  series  becomes  more  persistent.  In  Sect.  2.2.2  we  will 
provide  a  more  detailed  analysis  of  this  autoregressive  process. 

Finally,  we  construct  a  new  process  by  taking  cumulative  sums:  x\RW)  = 
T!,=iZt.  This  process  can  also  be  obtained  from  the  recursion  above  by  taking 
(p  =  1  so  that  XfRW>  =  xjR^  +  Zt.lt  is  called  a  random  walk.  Thus,  the  realization 
of  {X\RW)}  for  period  2  is  {x(RW)}  =  -0.6128  =  0.2590  -  0.8718,  for  period  3 
{^(SWO}  _  _i  4007  =  —0.6128— 0.7879,  and  so  on.  Again  the  resulting  realizations 

of  {X,(RW> }  for  t  =  2 _ ,6  are  reported  in  the  last  column  of  Table  1 . 1  and  the  plot 

is  shown  in  Fig.  1.9d.  On  can  see  how  the  series  moves  away  from  its  mean  of  zero 
more  persistently  than  all  the  other  three  process  considered.  In  Sect.  1 .4.4  we  will 
provide  a  more  detailed  analysis  of  this  so-called  random  walk  process  and  show 
that  it  is  not  stationary. 


7The  following  calculations  are  subject  to  rounding  to  four  digits. 
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Table  1.1  Construction  of 
stochastic  processes  assuming 
Z0  =  X0  =  0 


Time 

White 

noise 

Moving- 

average 

Auto¬ 

regressive 

Random 

walk 

1 

0.2590 

0.2590 

0.2590 

0.2590 

2 

-0.8718 

-0.6387 

-0.6387 

-0.6128 

3 

-0.7879 

-1.5726 

-1.3627 

-1.4007 

4 

-0.3443 

-1.0535 

-1.5708 

-1.7451 

5 

0.6476 

0.3377 

-0.7661 

-1.0974 

6 

2.0541 

2.6370 

1.3646 

0.9567 

c  d 


Fig.  1.9  Processes  constructed  from  a  given  white  noise  process,  (a)  White  noise,  (b)  Moving- 
average  with  0  =  0.9.  (c)  Autoregressive  with  tp  =  0.9.  (d)  Random  walk 


1 .4.3  Moving-Average  Process  of  Order  One 

The  white  noise  process  can  be  used  as  a  building  block  to  construct  more  complex 
processes  with  a  more  involved  autocorrelation  structure.  The  simplest  procedure 
is  to  take  moving  averages  over  consecutive  periods.8  This  leads  to  the  moving- 
average  processes.  The  moving-average  process  of  order  one,  MA(1)  process,  is 
defined  as 


!This  procedure  is  an  example  of  a  filter.  Section  6.4  provides  a  general  introduction  to  filters. 
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X,=Z,  +  ezt-\  with  Z,  ~  WN(0,  or2). 

Clearly,  USX,  =  EZ,  +  0EZ,_i  =  0.  The  mean  is  therefore  constant  and  equal  to 
zero. 

The  autocovariance  function  can  be  computed  as  follows: 

Yx(t  +  h ,  t)  =  cov{Xl+h,Xt) 

=  co  v(Z,+/,  +  #Z,+/,_i,Z,  +  9Zt- 1) 

=  E Zt+hZ,  +  9EZt+hZt-i  +  6E Zr+^Z,  +  e2EZt+h-1Z^i. 

Recalling  that  {Z,}  is  white  noise  so  that  EZ,2  =  a 2  and  EZ,Z,+/,  =  0  for  h  ^  0,  we 
therefore  get  the  following  autocovariance  function  of  {X,}: 

f  (1  +  <92)ct2  h  =  0; 

yx(h)=<6o2  h  =  ±1;  (1.1) 

0  otherwise. 

Thus  {Xf}  is  stationary  irrespective  of  the  value  of  6 .  The  autocorrelation  function  is: 

( 1  h  =  0; 

Px(h)=<  h  =  ±1; 

0  otherwise. 

Note  that  the  newly  created  process  now  exhibits  a  dependence  from  its  past  as 
X,  is  correlated  with  X,_i .  This  correlation  is  restricted  to  the  interval  [0,  i],  i.e. 
0  <  px  (1 )  <  As  the  correlation  between  X,  and  Xv  is  zero  when  t  and  s  are 
more  than  one  period  apart,  we  call  a  moving-average  process  a  process  with  finite 
memory  or  a  process  with  finite-range  dependence. 

Remark  1.10.  To  motivate  the  name  moving-average,  we  can  define  the  MA(1) 
process  more  generally  as 

X,  =  90Zt  +  0\Zt-X  with  Z,  ~  WN(0,  a2)  and  60  ^  0. 

Thus,  X,  is  a  weighted  average  of  Z,  and  Z,_i.  If  6q  =  0 1  =  1/2,  X,  is  just 
the  arithmetic  mean  of  Z,  and  Z,_| .  This  process  is,  however,  (observationally) 
equivalent  to  the  process 

X,  =  Z,  +  0Z'-1  with  Z,  ~  WN(0,  a2) 


1 .4  Construction  of  Stochastic  Processes 
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where  6  =  9\/6q  and  a2  =  O^a2.  Both  processes  would  generate  the  same  first  two 
moments  and  are  therefore  observationally  indistinguishable  from  each  other.  Thus, 
we  can  set  9q  =  1  without  loss  of  generality. 


1.4.4  Random  Walk 

Let  Z,  ~  WN(0,  a2)  be  a  white  noise  process  then  the  new  process  {X,}  defined  as 

t 

X,  =  Zl+Z2  +  ...  +  Z,  =  J2zj-  t>0 ■  (L2) 

7=1 

is  called  a  random  walk.  Note  that,  in  contrast  to  {Zr},  {X,}  is  only  defined  for  t  >  0. 
The  random  walk  may  alternatively  be  defined  through  the  recursion 

X,  =  Xt-i  +Zt,  t  >  0  and  X0  =  0. 

If  in  each  time  period  a  constant  8  is  added  such  that 

Xt  =  8  +  X,_!  +  Zf, 

the  process  {X,}  is  called  a  random  walk  with  drift. 

Although  the  random  walk  has  a  constant  mean  of  zero,  it  is  a  nonstationary 
process. 

Proposition  1.1.  The  random  walk  {X,}  as  defined  in  Eq.  (1.2)  is  nonstationary. 
Proof.  The  variance  of  Xt+]  —  X\  equals  ¥(X,+i  —  Xi)  =  ¥  (Yl'j^2Zj)  = 

VZy  =  ^2- 

Assume  for  the  moment  that  {X,}  is  stationary  then  the  triangular  inequality 
implies  for  t  >  0: 

0  <  V to2  =  std(X,+i  — Xi)  <  std(X,+i)  +  std(Xi)  =  2std(Xi) 

where  “std”  denotes  the  standard  deviation.  As  the  left  hand  side  of  the  inequality 
converges  to  infinity  for  t  going  to  infinity,  also  the  right  hand  side  must  go 
to  infinity.  This  means  that  the  variance  of  Xi  must  be  infinite.  This,  however, 
contradicts  the  assumption  of  stationarity.  Thus  {X,}  cannot  be  stationary.  □ 

The  random  walk  represents  by  far  the  most  widely  used  nonstationary  process 
in  economics.  It  has  proven  to  be  an  important  ingredient  in  many  economic 
time  series.  Typical  nonstationary  time  series  which  are  or  are  driven  by  random 
walks  are  stock  market  prices,  exchange  rates,  or  the  gross  domestic  product 
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(GDP).  Usually  it  is  necessary  to  apply  some  transformation  (filter)  first  to  achieve 
stationarity.  In  the  example  above,  one  has  to  replace  {X,}  by  its  first  difference 
{AX,}  =  {X,  —  X,_i}  =  {Z,}  which  is  stationary  by  construction.  Time  series 
which  become  stationary  after  differencing  are  called  integrated  processes  and  are 
the  subject  of  a  more  in  depth  analysis  in  Chap.  7.  Besides  ordinary  differencing, 
other  transformations  are  often  encountered:  seasonal  differencing,  inclusion  of  a 
time  trend,  seasonal  dummies,  moving  averages,  etc.  Some  of  them  will  be  discussed 
as  we  go  along. 

1.4.5  Changing  Mean 

Finally,  here  is  another  simple  example  of  a  nonstationary  process. 


Yt,  t  <  tc; 

Yt  +  c,  t  >  tc  und  c  ^  0 


where  tc  is  some  specific  point  in  time.  {X,}  is  clearly  not  stationary  because  the 
mean  is  not  constant.  In  econometrics  we  refer  to  such  a  situation  as  a  structural 
change  which  can  be  accommodated  by  introducing  a  so-called  dummy  variable. 
Models  with  more  sophisticated  forms  of  structural  changes  will  be  discussed  in 
Chap.  18 


1 .5  Properties  of  the  Autocovariance  Function 

The  autocovariance  function  represents  the  directly  accessible  external  properties 
of  the  time  series.  It  is  therefore  important  to  understand  its  properties  and  how 
it  is  related  to  its  inner  structure.  We  will  deepen  the  connection  between  the 
autocovariance  function  and  a  particular  class  of  models  in  Chap.  2.  The  estimation 
of  the  autocovariance  function  will  be  treated  in  Chap.  4.  For  the  moment  we  will 
just  give  its  properties  and  analyze  the  case  of  the  MA(1)  model  as  a  prototypical 
example. 

Theorem  1.1.  The  autocovariance  function  of  a  stationary  process  {X,}  is  charac¬ 
terized  by  the  following  properties: 

(i)  Yx( 0)  >  0; 

(ii)  0  <  \yx(h)\  <  yx( 0); 

(iii)  Yx(h)  =  yx(-h); 

(iv)  l  aiYx(ti  —  tj)aj  >  0  for  all  n  and  all  vectors  («i , . . . ,  anf  and  {t\ , . . . ,  t„). 
This  property  is  called  non-negative  definiteness. 


Proof.  The  first  property  is  obvious  as  the  variance  is  always  nonnegative.  The 
second  property  follows  from  the  Cauchy-Bunyakovskii-Schwarz  inequality(see 
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Theorem  C.l)  applied  to  X,  and  Xt+i,  which  yields  0  <  \yx(h)\  <  Yx( 0).  The  third 
property  follows  immediately  from  the  dehnition  of  the  covariance.  Define  a  = 
(a i , . . . ,  a„Y  and  X  =  (Xtl , . . . ,  Xtn)'  then  the  last  property  follows  from  the  fact  that 
the  variance  is  always  nonnegative:  0  <  V  (a'X)  =  a'Y(X)a  =  Y^j=i  aiYx(ti~tj)aj. 

□ 

Similar  properties  hold  for  the  correlation  function  px,  except  that  we  have 

Px(  0)  =1. 

Theorem  1.2.  The  autocorrelation  function  of  a  stationary  stochastic  process  !  X,  } 
is  characterized  by  the  following  properties: 

(i)  px(0)  =  1; 

(ii)  0  <  \px(h)\  <  1; 

(iii)  px{h)  =  Px(~h); 

(iv)  i  aipx(ti  —  tj)aj  >  0  for  all  n  and  all  vectors  (a i , . . .  ,  a,,)'  and  {t\ , . . . ,  tn). 

Proof  The  proof  follows  immediately  from  the  properties  of  the  autocovariance 
function.  □ 

It  can  be  shown  that  for  any  given  function  with  the  above  properties  there  exists 
a  stationary  process  (Gaussian  process)  which  has  this  function  as  its  autocovariance 
function,  respectively  autocorrelation  function. 


1.5.1  Autocovariance  Function  of  MA(1)  Processes 

The  autocovariance  function  describes  the  external  observable  characteristics  of  a 
time  series  which  can  be  estimated  from  the  data.  Usually,  we  want  to  understand 
the  internal  mechanism  which  generates  the  data  at  hand.  For  this  we  need  a  model. 
Hence  it  is  important  to  understand  the  relation  between  the  autocovariance  function 
and  a  certain  class  of  models.  In  this  section,  by  analyzing  the  MA(1)  model,  we 
will  show  that  this  relationship  is  not  one-to-one.  Thus  we  are  confronted  with  a 
fundamental  identification  problem. 

In  order  to  make  the  point,  consider  the  following  given  autocovariance  function: 

(  yo,  h  =  0; 

Y(h)  =  |  Yu  h  =  ±1; 
lO,  \h\  >  1. 

The  problem  consists  of  determining  the  parameters  of  the  MA(1)  model,  9  and 
(T2,  from  the  values  of  the  autocovariance  function.  For  this  purpose  we  equate 
Yo  =  (1  +  92)o2  and  yi  =  9o2  (see  Eq.  (1.1)).  This  leads  to  an  equation  system  in 
the  two  unknowns  6  und  a2.  This  system  can  be  simplified  by  dividing  the  second 
equation  by  the  first  one  to  obtain:  yi/yo  =  9/(l  +  92).  Because  Yi/Yo  =  p(  1 )  =  p\ 
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one  gets  a  quadratic  equation  in  6 : 

pi  62  —  9  ±  pi  =0. 

The  two  solutions  of  this  equation  are 

»"=2 

The  solutions  are  real  if  and  only  if  the  discriminant  1  —  Ap\  is  positive.  This  is  the 
case  if  and  only  if  pj  <  1/4,  respectively  |pi|  <  1/2.  Note  that  one  root  is  the 
inverse  of  the  other.  The  identification  problem  thus  takes  the  following  form: 

| Pi  |  <  1/2:  there  exists  two  observationally  equivalent  MA(1)  processes  corre¬ 
sponding  to  the  two  solutions  6\  and  62- 
pi  =  ±1/2:  there  exists  exactly  one  MA(1)  process  with  6  =  ±1. 

| Pi  |  >  1/2:  there  exists  no  MA(1)  process  with  this  autocovariance  function. 

The  relation  between  the  first  order  autocorrelation  coefficient,  pi  =  p(l),  and  the 
parameter  6  of  the  MA(1)  process  is  represented  in  Fig.  1.10.  As  can  be  seen, 
there  exists  for  each  p(l)  with  |p(l)|  <  ■  two  solutions.  The  two  solutions  are 
inverses  of  each  other.  Hence  one  solution  is  absolutely  smaller  than  one  whereas  the 
other  is  bigger  than  one.  In  Sect.  2.3  we  will  argue  in  favor  of  the  solution  smaller 
than  one.  For  p(l)  =  ±1/2  there  exists  exactly  one  solution,  namely  6  =  ±1. 
For  |p(l)|  >  1/2  there  is  no  solution.  For  |pi|  >  1/2,  p(h)  actually  does  not 
represent  a  genuine  autocorrelation  function  as  the  fourth  condition  in  Theorem  1.1, 
respectively  Theorem  1.2  is  violated.  For  p\  >  i  set  a  =  (1,  —  1, 1,  -H, . . . ,  1,  —  1)' 
to  get: 


n 

Y]  ajp(i  —j)aj  =  n  —  2 (n  —  l)pi  <  0, 
U=  1 


if  n  > 


2pi 

2pi  —  1 


For  pi  =  —  2  one  sets  a  =  (1.1 . 1)'.  Hence  the  fourth  property  is  violated. 


1 .6  Exercises 

Exercise  1.6.1.  Let  the  process  \X,\  be  generated  by  a  two-sided  moving-average 
process 

X,  =  0.5Z,+  i  ±  0.5Z,_i  with  Z,  ~  WN(0,  a2). 

Determine  the  autocovariance  and  the  autocorrelation  function  of{X,}. 

Exercise  1.6.2.  Let  { X,  j  be  the  MA(  1 )  process 
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first  order  “moving  average”  parameter^ 


Fig.  1.10  Relation  between  the  autocorrelation  coefficient  of  order  one,  p(l),  and  the  parameter 
9  of  a  MA(1)  process 


X,=Z,  +  6Z,-2  With  Z ,  ~  WN(0,  a2). 

(i)  Determine  the  autocovariance  and  the  autocorrelation  function  of  {X,}  for 

e  =  o.9. 

(ii)  Determine  the  variance  of  the  mean  (Xi  +  X2  +  X3  +  Xf)/A. 

(iii)  How  do  the  previous  results  change  if  9  =  —0.9? 

Exercise  1.6.3.  Consider  the  autocovariance  function 

(4,  h  =  0; 
y(h)  =  <  — 2,  h  =  ±1; 

(  0,  otherwise. 

Determine  the  parameters  9  and  a1,  if  they  exist,  of  the  first  order  moving-average 
process  Xt  =  Z,  +  9Z,-\  with  Z,  ~  WN(0,ct2)  such  that  autocovariance  function 
above  is  the  autocovariance  function  corresponding  to  {X,}. 
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Exercise  1.6.4.  Let  the  stochastic  process  {  X,  j  be  defined  as 

!Zf,  if  t  is  even; 

(ZU  -  1)/V2,  ift  is  uneven, 

where  {Z,}  is  identically  and  independently  distributed  as  Z,  ~  N(0,  1).  Show  that 
{X,}  ~  WN(0, 1),  but  not  IID(0, 1). 

Exercise  1.6.5.  Which  of  the  following  processes  is  stationary? 

(i)  X,  =  Zt  +  9Zt- 1 

(ii)  X,  =  Z,Z,_j 

(iii)  X,  =  a  +  0Zo 

(iv)  X,  =  Zq  sin(flf) 

In  all  cases  we  assume  that  {Z,}  is  identically  and  independently  distributed  with 
Z,  ~  N(0,  a2).  9  and  a  are  arbitrary  parameters. 


Autoregressive  Moving-Average  Models 


2 


A  basic  idea  in  time  series  analysis  is  to  construct  more  complex  processes  from 
simple  ones.  In  the  previous  chapter  we  showed  how  the  averaging  of  a  white 
noise  process  leads  to  a  process  with  first  order  autocorrelation.  In  this  chapter  we 
generalize  this  idea  and  consider  processes  which  are  solutions  of  linear  stochastic 
difference  equations.  These  so-called  ARMA  processes  constitute  the  most  widely 
used  class  of  models  for  stationary  processes. 

Definition  2.1  (ARMA  Models).  A  stochastic  process  { X,  j  with  t  e  7L  is  called 
an  autoregressive  moving-average  process  ( ARMA  process)  of  order  ( p ,  q),  denoted 
by  ARMA(p,  q)  process,  if  the  process  is  stationary  and  satisfies  a  linear  stochastic 
difference  equation  of  the  form 


X,  —  —  ...  —  fipXt-p  —  Z,  +  9\Zt-\  +  .  .  .  +  OqZf—q  (2.1) 

with  Z,  ~  WN(0,  a2)  and  4>p6q  0.  {A,}  is  called  an  ARMA(p,q)  process  with 

mean  p  if{X,  —  p}  is  an  ARMA(p ,  q)  process. 

The  importance  of  ARMA  processes  is  due  to  the  fact  that  every  stationary 
process  can  be  approximated  arbitrarily  well  by  an  ARMA  process.  In  particular, 
it  can  be  shown  that  for  any  given  autocovariance  function  y  with  the  property 
lim/,-,.^  y(h)  =  0  and  any  positive  integer  k  there  exists  an  autoregressive  moving- 
average  process  (ARMA  process)  {A,}  such  that  yx(h)  =  y(h),  h  =  0,  1 .....  A. 

For  an  ARMA  process  with  mean  p  one  often  adds  a  constant  c  to  the  right  hand 
side  of  the  difference  equation: 


A,  —  flXt-l  —  ...  —  <ppXt-p  —  C  +  Z,  +  0lZ,_l  +  .  .  .  +  9qZ,-q. 

The  mean  of  A,  is  then:  p  =  ^  ■  The  mean  is  therefore  only  well-defined  if 

0i  +  . . .  +  (j)p  1.  The  case  <p\  +  . . .  +  cpp  =  1  can,  however,  be  excluded  because 
there  exists  no  stationary  solution  in  this  case  (see  Remark  2.2)  and  thus  no  ARMA 
process. 
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2.1  The  Lag  Operator 

In  times  series  analysis  it  is  customary  to  rewrite  the  above  difference  equation  more 
compactly  in  terms  of  the  lag  operator  L.  This  is,  however,  not  only  a  compact 
notation,  but  will  open  the  way  to  analyze  the  inner  structure  of  ARMA  processes. 
The  lag  or  back-shift  operator  L  moves  the  time  index  one  period  back: 


L{A,}  = 


For  ease  of  notation  we  write:  LA,  =  A,_i .  The  lag  operator  is  a  linear  operator  with 
the  following  calculation  rules: 

(i)  L  applied  to  the  process  {A,  =  c}  where  c  is  an  arbitrary  constant  gives: 


Lc  =  c. 


(ii)  Applying  L  rt  times: 


L_LA,  =  L”A,  =  Xt-n. 

n  times 


(iii)  The  inverse  of  the  lag  operator  is  the  lead  or  forward  operator.  This  operator 
shifts  the  time  index  one  period  into  the  future. 1  We  can  write  L-1 : 


L_iA,  =  A,+  i . 


(iv)  For  any  integers  m  and  n  we  have: 

Lm T  _  y  /71+/1XZ-  y 

J _/  —  J _/  j\.f  —  — yyi — pi . 

(v)  As  L“ 1  LA,  =  A,  we  have  that 

L°  =  1. 

(vi)  For  any  real  numbers  a  and  b ,  any  integers  m  and  n,  and  arbitrary  stochastic 
processes  {A,}  and  {F,}  we  have: 

(ah'"  +  bh")  (A,  +  Yr)  =  aXt-m  +  bX,-n  +  aY,-m  +  bY,-n. 

In  this  way  it  is  possible  to  define  lag  polynomials :  A(L)  =  ao  +  a\L  +  c^L2  +  . . .  + 
apLp  where  «o , « i , . . . ,  ap  are  any  real  numbers.  For  these  polynomials  the  usual 


1  One  technical  advantage  of  using  the  double-infinite  index  set  Z  is  that  the  lag  operators  form  a 

group. 
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calculation  rules  apply.  Let,  for  example,  A( L)  =  1  —  0.5L  and  B( L)  =  1  +  4L2 
then  C(L)  =  A(L)fl(L)  =  1  -  0.5L  +  4L2  -  2L3. 

Applied  to  the  stochastic  difference  equation,  we  define  the  autoregressive  and 
the  moving-average  polynomial  as  follows: 


5>(L)  =  l-0tL  -...-4>PU, 

@(L)  =  1  +  9\L  +  . . .  +  dqU1 . 

The  stochastic  difference  equation  defining  the  ARMA  process  can  then  be  written 
compactly  as 


<L(L)X,  =  0(L  )Z„ 

Thus,  the  use  of  lag  polynomials  provides  a  compact  notation  for  ARMA  processes. 
Moreover  and  most  importantly,  and  @(z),  viewed  as  polynomials  of  the 
complex  number  z,  also  reveal  much  of  their  inherent  structural  properties  as  will 
become  clear  in  Sect.  2.3. 


2.2  Some  Important  Special  Cases 

Before  we  deal  with  the  general  theory  of  ARMA  processes,  we  will  analyze  some 
important  special  cases  first: 

q  =  0:  autoregressive  process  of  order  p,  AR  (p)  process 

p  =  0:  moving-average  process  of  order  q ,  M A(q)  process 

2.2.1  The  Moving-Average  Process  of  Order  q  (MA(r/)  Process) 

The  MA(g)  process  is  defined  by  the  following  stochastic  difference  equation: 

X,  =  @(L)Z,  =  9oZ,  +  9\Zt-\  +  ■  ■  ■  +  9qZf~q  with  9q  =  1  and  9q^  0 


and  Z,  ~  WN(0,  a2).  Obviously, 

EX,  =  EZ,  +  0iEZ,_!  +  . . .  +  9qEZ,-q  =  0, 

because  Z,  ~  WN(0,  a2).  As  can  be  easily  verified  using  the  properties  of  {Z,},  the 
autocovariance  function  of  the  MA(g)  processes  are: 


yx(h)  =  cov(X,+ft.X,)  =  E(Xl+hXt) 

=  E(Z,+/i  +  9\Zt+h-l  +  .  .  .  +  9qZ,+l1-q)(Zt  +  9\Zt-\  +  .  .  .  +  9qZ,-q) 
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Fig.  2.1  Realization  and  estimated  ACF  of  a  MA(1)  process:  X,  =  Z,  —  0.8Z,_i  with  Z,  ~ 
I1DN(0,  1) 


_  |  ^2Ef=o'1  ^^-+1*1'  1*1  ^ 

(  0,  \h\  >  q. 


This  implies  the  following  autocorrelation  function: 


px{h)  =  corr(X,+h.X,) 


yT-  g2  ELo'1  eA+\h\'  1*1  < 

Z—i= 0  ui 

0,  \h\  >  q. 


Every  MA(g)  process  is  therefore  stationary  irrespective  of  its  parameters 
00,  6i,  ■  ■  ■ ,  0q.  Because  the  correlation  between  X,  and  Xs  is  equal  to  zero  if  the 
two  time  points  t  and  s  are  more  than  q  periods  apart,  such  processes  are  sometimes 
called  processes  with  short  memory  or  processes  with  short  range  dependence. 
Figure  2.1  displays  an  MA(1)  process  and  its  autocorrelation  function. 
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2.2.2  The  First  Order  Autoregressive  Process  (AR(1 )  Process) 

The  AR(p)  process  requires  a  more  thorough  analysis  as  will  already  become 
clear  from  the  AR(1)  process.  This  process  is  defined  by  the  following  stochastic 
difference  equation: 


X,  =  <pX,-X  +  Z„  Z,  ~  WN(0,  cr2)  and  0^0.  (2.2) 

The  above  stochastic  difference  equation  has  in  general  several  solutions.  Given  a 
sequence  {Z,}  and  an  arbitrary  distribution  for  Xo,  it  determines  all  random  variables 
Xt,  t  e  Z  \  {0},  by  applying  the  above  recursion.  The  solutions  are,  however,  not 
necessarily  stationary.  But,  according  to  the  Definition  2.1,  only  stationary  processes 
qualify  for  ARMA  processes.  As  we  will  demonstrate,  depending  on  the  value  of  0, 
there  may  exist  no  or  just  one  solution. 

Consider  first  the  case  of  |0|  <  1.  Inserting  into  the  difference  equation  several 
times  leads  to: 


X,  —  4>Xt- 1  +  Z,  —  0-Xf_2  +  0Zf_  i  +  Z, 


=  z,  +  0Zf_!  +  02Z,_2  +  . . .  +  <j)kZ,-k  +  <pk+lXt-k-\. 


If  { X, }  is  a  stationary  solution,  YXt-k-i  remains  constant  independently  of  k.  Thus 


X,-J2  ftZt-j  I  =  02*+2VX,_,t_i  ->  0  for  k  ■ 


oo. 


This  shows  that  Ylj=o  0,^f-;  converges  in  the  mean  square  sense,  and  thus  also  in 
probability,  to  X,  for  k  — »•  oo  (see  Theorem  C.8  in  Appendix  C).  This  suggests  to 
take 


Xt  =  Z,  +  0Zt_!  +  02Z,_2  +  ...  =  J2  VZt-i  (2.3) 

j=o 

as  the  solution  to  the  stochastic  difference  equation.  As  0'|  =  <  oo  this 

solution  is  well-defined  according  to  Theorem  6.4  and  has  the  following  properties: 

OO 

E X,  =  Y]  0'E ZH  =  0, 

7=0 
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Yx(h)  =  cov(Xt+h,Xt)  =  ^lirr^E  j  ypizl+h-j  |  [  ^ ~2^Z,-j 


r2J*l 


0l,! 


u=o 


U=0 


Y^=^-a\  AeZ, 

y  l  -p2 


7=0 


Px(h)  =  p 


1*1 


Thus  the  solution  X,  =  flZt-j  is  stationary  and  fulfills  the  difference  equation 
as  can  be  easily  verified.  It  is  also  the  only  stationary  solution  which  is  compatible 
with  the  difference  equation.  Assume  that  there  is  second  solution  { X, }  with  these 
properties.  Inserting  into  the  difference  equation  yields  again 

v  -  J2  fiZ'-j j  =  02A+2vz,_*_1. 

This  variance  converges  to  zero  for  k  going  to  infinity  because  0  <  1  and  because 
{Z,}  is  stationary.  The  two  processes  {Z,}  and  {Z,}  with  Z,  =  1 pzt-j  are 

therefore  identical  in  the  mean  square  sense  and  thus  with  probability  one. 

Finally,  note  that  the  recursion  (2.2)  will  only  generate  a  stationary  process  if  it 
is  initialized  with  Z0  having  the  stationary  distribution,  i.e.  if  EZ0  =  0  and  ¥Z0  = 
(t2/(1  —  <p2).  If  the  recursion  is  initiated  with  an  arbitrary  variance  of  Zo,  0  <  (Tg  < 
oo,  Eq.  (2.2)  implies  the  following  difference  equation  for  the  variance  of  Z,,  ct2: 

„  _  j.2  _2  ,  2 

Of  —  <p  0,-1  +  o  . 

The  solution  of  this  difference  equation  is 

o2  -  a2  =  (a2  -  a2)(02)' 

where  ar2  =  ct2/(1  —  p2)  denotes  the  variance  of  the  stationary  distribution.  If  a,2  ^ 
ct2,  ct2  is  not  constant  implying  that  the  process  {Z,}  is  not  stationary.  However, 
as  |0|  <  1,  the  variance  of  Zf,  ct2,  will  converge  to  the  variance  of  the  stationary 
distribution.2 

Figure  2.2  shows  a  realization  of  such  a  process  and  its  estimated  autocorrelation 
function. 

In  the  case  |0]  >  1  the  solution  (2.3)  does  not  converge.  It  is,  however,  possible 
to  iterate  the  difference  equation  forward  in  time  to  obtain: 


2Phillips  and  Sul  (2007)  provide  an  application  and  an  in  depth  discussion  of  the  hypothesis  of 
economic  growth  convergence. 
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realization 


Fig.  2.2  Realization  and  estimated  ACF  of  an  AR(1)  process:  X,  =  0.8X,_i  +  Z,  with  Z,  ~ 
IIN(0,  1) 


X,  =  (p~xXt+\  —  (p~lZ,+  \ 

=  <t>  k  XXt+k+\—(j)  1  Zf-|_ i  (p  2Z,+2  —  ■  ■  ■  —  (p  k  'Zr+jt+l- 

This  suggests  to  take 


OO 

xt  =  —  ^2  <P  jx,+j 

i=  i 

as  the  solution.  Going  through  similar  arguments  as  before  it  is  possible  to  show  that 
this  is  indeed  the  only  stationary  solution.  This  solution  is,  however,  viewed  to  be 
inadequate  because  Xt  depends  on  future  shocks  zt+j  ,j  =  1,2,...  Note,  however, 
that  there  exists  an  AR(1)  process  with  \(j>\  <  1  which  is  observationally  equivalent, 
in  the  sense  that  it  generates  the  same  autocorrelation  function,  but  with  a  new  shock 
or  forcing  variable  {Z,}  (see  next  section). 

In  the  case  \<p\  =  1  there  exists  no  stationary  solution  (see  Sect.  1.4.4)  and 
therefore,  according  to  our  definition,  no  ARMA  process.  Processes  with  this 
property  are  called  random  walks,  unit  root  processes  or  integrated  processes.  They 
play  an  important  role  in  economics  and  are  treated  separately  in  Chap.  7. 
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2.3  Causality  and  Invertibility 

If  we  interpret  { X,  j  as  the  state  variable  and  {Z,}  as  an  impulse  or  shock,  we  can 
ask  whether  it  is  possible  to  represent  today’s  state  X,  as  the  outcome  of  current 
and  past  shocks  Z,,  Z,_i ,  Zr_ 2, ...  In  this  case  we  can  view  X,  as  being  caused  by 
past  shocks  and  call  this  a  causal  representation.  Thus,  shocks  to  current  Z,  will  not 
only  influence  current  Z,,  but  will  propagate  to  affect  also  future  X,’s.  This  notion  of 
causality  rest  on  the  assumption  that  the  past  can  cause  the  future  but  that  the  future 
cannot  cause  the  past.  See  Sect.  15.1  for  an  elaboration  of  the  concept  of  causality 
and  its  generalization  to  the  multivariate  context. 

In  the  case  that  { X,  j  is  a  moving-average  process  of  order  q,  X,  is  given  as 
a  weighted  sum  of  current  and  past  shocks  Zt,Zt-\, . . . , Z,-q .  Thus,  the  moving- 
average  representation  is  already  the  causal  representation.  In  the  case  of  an 
AR(1)  process,  we  have  seen  that  this  is  not  always  feasible.  For  \<f>\  <  1,  the 
solution  (2.3)  represents  X,  as  a  weighted  sum  of  current  and  past  shocks  and  is 
thus  the  corresponding  causal  representation.  For  \tp\  >  1,  no  such  representation  is 
possible.  The  following  Definition  2.2  makes  the  notion  of  a  causal  representation 
precise  and  Theorem  2.1  gives  a  general  condition  for  its  existence. 

Definition  2.2  (Causality).  AnARMA(p.q)  process  {X,\  with  0(L)X,  =  @(L)Z,  is 
called  causal  with  respect  to  {Z,}  if  there  exists  a  sequence  {1/')}  with  the  property 
£.=0  IV'}- 1  <  00  such  that 

OO 

X,  =  Z,  +  1^1 4-1  +  Va4-2  +  ■  ■  ■  =  y*.  fjZt-j  =  'F(L)Z,  with  f0  =  1. 

j= 0 

where  'I'(L)  =  1  +  +  V'T L2  +  . . .  =  £T_0  tjrjU.  The  above  equation  is  referred 

to  as  the  causal  representation  of  {X,}  with  respect  to  {Z,}. 

The  coefficients  {1 jrj)  are  of  great  importance  because  they  determine  how  an 
impulse  or  a  shock  in  period  t  propagates  to  affect  current  and  future  Xt+j,  j  = 
0, 1, 2  ...  In  particular,  consider  an  impulse  e,n  at  time  to,  i.e.  a  time  series  which  is 
equal  to  zero  except  for  the  time  to  where  it  takes  on  the  values  e,0.  Then,  { i//,_,0  e,(]  J 
traces  out  the  time  history  of  this  impulse.  For  this  reason,  the  coefficients  1 jrj  with 
j  =  t  —  to,  t  =  to,  to  +  1,  to  +  2, ... ,  are  called  the  impulse  response  function. 
If  e,0  =  1,  it  is  called  a  unit  impulse.  Alternatively,  e,0  is  sometimes  taken  to  be 
equal  to  <7,  the  standard  deviation  of  Z,.  It  is  customary  to  plot  ij/j  as  a  function  of  j, 

j=  0,1,2... 

Note  that  the  notion  of  causality  is  not  an  attribute  of  {4},  but  is  defined  relative 
to  another  process  {Z,}.  It  is  therefore  possible  that  a  stationary  process  is  causal 
with  respect  to  one  process,  but  not  with  respect  to  another  process.  In  order  to  make 
this  point  more  concrete,  consider  again  the  AR(1)  process  defined  by  the  equation 
X,  =  (pXt- 1  +Z,  with  \<p\  >  1 .  As  we  have  seen,  the  only  stationary  solution  is  given 
by  X,  =  —  ffjZ  1  f-'Zt+j  which  is  clearly  not  causal  with  respect  {Z,}.  Consider  as 
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an  alternative  the  process 


(2.4) 


This  new  process  is  white  noise  with  variance  a2  =  <p  2cr2.3  Because  {X,}  fulfills 
the  difference  equation 


X,  =  —X,—\  +  Z„ 

<P 


{ X,  j  is  causal  with  respect  to  { Z,  j .  This  remark  shows  that  there  is  no  loss  of 
generality  involved  if  we  concentrate  on  causal  ARMA  processes. 

Theorem  2.1.  Let  {Xt}  be  an  ARMA(p.  q)  process  with  <b(L)X,  =  @(L )Z,  and 
assume  that  the  polynomials  <J>(z)  and  @(z)  have  no  common  root.  {X,}  is  causal 
with  respect  to  {Z,}  if  and  only  if  $(z)  /  Ofor  |z|  <  1,  i.e.  all  roots  of  the  equation 
<E>(z)  =  0  are  outside  the  unit  circle.  The  coefficients  {fj}  are  then  uniquely  defined 
by  identity : 


Proof.  Given  that  <J>(~)  is  a  finite  order  polynomial  with  <t>(z)  f  0  for  |z|  <  1,  there 
exits  e  >  0  such  that  <l>(z)  f  0  for  M  <  1  +  e.  This  implies  that  I/O  (z)  is  an 
analytic  function  on  the  circle  with  radius  1  +  e  and  therefore  possesses  a  power 
series  expansion: 


for  |z|  <  1  +  e. 


This  implies  that  £)(  1  +e/2V  goes  to  zero  for  j  to  infinity.  Thus  there  exists  a  positive 
and  finite  constant  C  such  that 


|£|  <  C(1  +  e/2)-/  for  all  j  =  0, 1, 2, . . . 


This  in  turn  implies  that  <  oo  and  that  S(z)0(z)  =  1  for  kl  <  1- 

Applying  H(L)  on  both  sides  of  <S(L)X,  =  @(L)Z,,  gives: 


X,  =  S(L)0(L)X,  =  H(L)0(L)Z,. 


3The  reader  is  invited  to  verify  this. 
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Theorem  6.4  implies  that  the  right  hand  side  is  well-defined.  Thus  T(L)  = 
H(L)@(L)  is  the  sought  polynomial.  Its  coefficients  are  determined  by  the  relation 
T(z)  =  0(z)/  't’(z). 

Assume  now  that  there  exists  a  causal  representation  Xt  =  Y/jZo  with 

ZZjZ  o  I  VO'I  <  00  ■  Therefore 

0(L  )Z,  =  <f>(L)Z,  =  <I>(L)'P(L)Zr. 

Take  q(z)  =  t5(z)'T(z)  =  ZZ/Zo  lyz/,  ]z]  <  1.  Multiplying  the  above  equation  by 
Z,-k  and  taking  expectations  shows  that  %  =  9k,  k  =  0, 1, 2, . . . ,  q,  and  that  =  0 
for  k  >  q.  Thus  we  get  @(z)  =  t](z)  =  <t’(z)lI'(")  for  z  <  1.  As  0 (z)  and  O(z) 
have  no  common  roots  and  because  |vk(z)|  <00  for  |z|  <  1,  O(z)  cannot  be  equal 
to  zero  for  |z|  <  1.  □ 

Remark  2.1.  If  the  AR  and  the  MA  polynomial  have  common  roots,  there  are  two 
possibilities: 

•  No  common  roots  lies  on  the  unit  circle.  In  this  situation  there  exists  a  unique 
stationary  solution  which  can  be  obtained  by  canceling  the  common  factors  of 
the  polynomials. 

•  If  at  least  one  common  root  lies  on  the  unit  circle  then  more  than  one  stationary 
solution  may  exist  (see  the  last  example  below). 


Some  Examples 

We  concretize  the  above  Theorem  and  Remark  by  investigating  some  examples 

starting  from  the  ARMA  model  <I>(L)W  =  0(L)Z,  with  Z,  ~  WN(0,  a2). 

<f>(L)  =  1  —  0.05L  —  0.6L2  and  0(L)  =  1:  The  roots  of  the  polynomial  <J>(z)  are 
Zi  =  —4/3  and  z.2  =  5/4.  Because  both  roots  are  absolutely  greater  than  one, 
there  exists  a  causal  representation  with  respect  to  {Z,}. 

<3>(L)  =  1  +  2L  +  5/4L2  and  0(L)  =1:  In  this  case  the  roots  are  conjugate 
complex  and  equal  to  zi  =  —4/5  +  2/5i  and  zi  =  —4/5  —  2/5? .  The  modulus 
or  absolute  value  of  zi  and  zi  equals  |zi|  =  \zz\  =  2/20/25.  This  number  is 
smaller  than  one.  Therefore  there  exists  a  stationary  solution,  but  this  solution  is 
not  causal  with  respect  to  {Z,}. 

4>(L)  =  1  —  0.05L  —  0.6L2  and  0(L)  =  1  +  0.75L:  <I>(z)  and  0(z)  have  the 

common  root  z  =  —4/3  7^  1.  Thus  one  can  cancel  both  ^(L)  and  0(L)  by 
1  +  |L  to  obtain  the  polynomials  $(L)  =  1  —  0.8L  and  0(L)  =  1.  Because  the 
root  of  T* ( z)  equals  5/4  which  is  greater  than  one,  there  exists  a  unique  stationary 
and  causal  representation  with  respect  to  {Z,}. 

<3>(L)  =  1  +  1.2L  —  1.6L2  and  0(L)  =  1  +  2L:  The  roots  of  $(z)  are  zi  =  5/4 
and  z?  =  —0.5.  Thus  one  root  is  outside  the  unit  circle  whereas  one  is  inside. 
This  would  suggest  that  there  is  no  causal  solution.  However,  the  root  —0.5  7^  1  is 
shared  by  <f>(z)  and  0(z)  and  can  therefore  be  canceled  to  obtain  O(L)  =  1  — 0.8L 


2.3  Causality  and  Invertibility 


35 


and  @(L)  =  1.  Because  the  root  of  O(z)  equals  5/4  >  1,  there  exists  a  unique 
stationary  and  causal  solution  with  respect  to  {Z,}. 

<3>(L)  =  1  +  L  and  @(L)  =  1  +  L:  <t>(V)  and  (-) (z)  have  the  common  root  —1 
which  lies  on  the  unit  circle.  As  before  one  might  cancel  both  polynomials  by 
1  +  L  to  obtain  the  trivial  stationary  and  causal  solution  {X,}  =  {Z,}.  This 
is,  however,  not  the  only  solution.  Additional  solutions  are  given  by  {Y,}  = 
{Z,  +  A(—  l)f}  where  A  is  an  arbitrary  random  variable  with  mean  zero  and  finite 
variance  a \  which  is  independent  from  both  { X, j  and  { Z,  j.  The  process  {Y,}  has 
a  mean  of  zero  and  an  autocovariance  function  yy(h)  which  is  equal  to 


yY(h)  = 


j  ^ 

1  (-1)V42, 


h  =  0; 

h  =  ±1,  ±2, . . . 


Thus  this  new  process  is  therefore  stationary  and  fulfills  the  difference  equation. 


Remark  2.2.  If  the  AR  and  the  MA  polynomial  in  the  stochastic  difference  equation 
<f>(L)A,  =  0(L)Z,  have  no  common  root,  but  <t>(z)  =  0  for  some  z  on  the  unit  circle, 
there  exists  no  stationary  solution.  In  this  sense  the  stochastic  difference  equation 
does  no  longer  define  an  ARMA  model.  Models  with  this  property  are  said  to  have 
a  unit  root  and  are  treated  in  Chap.  7.  If  <I>(z)  has  no  root  on  the  unit  circle,  there 
exists  a  unique  stationary  solution. 

As  explained  in  the  previous  Theorem,  the  coefficients  {i fo}  of  the  causal 
representation  are  uniquely  determined  by  the  relation  'T(z)cE)(z)  =  @(z).  If  {Xr\  is 
a  MA  process,  <f>(z)  =  1  and  the  coefficients  {i///[  just  correspond  to  the  coefficients 
of  the  MA  polynomial,  i.e.  i frj  =  6j  for  0  <  j  <  q  and  i \tj  =  0  for  j  >  q.  Thus 
in  this  case  no  additional  computations  are  necessary.  In  general  this  is  not  the 
case.  In  principle  there  are  two  ways  to  find  the  coefficients  {i /r,-}.  The  first  one 
uses  polynomial  division  or  partial  fractions,  the  second  one  uses  the  method  of 
undetermined  coefficients.  This  book  relies  on  the  second  method  because  it  is  more 
intuitive  and  presents  some  additional  insides.  For  this  purpose  let  us  write  out  the 
defining  relation  'T(z)|I>(z)  =  @(z): 


(x//0  +  l/qz  +  f 2Z 2  +  .  .  .)  (1  -  <j)\Z  -  (p2Z2  -  ...  <ppZP) 

=  1  +  0\Z  +  O2Z 2  +  . . .  +  9qzq 


Multiplying  out  the  left  hand  side  one  gets: 


-  V'ix/’iz  -  ^0<p2Z2  -  irofcz3 - ifo(j)pZp 

\jf\z  -  \[ri(piz2  -  fitpiz2 - f\<ppzp+l 

+  ■^2  Z1  -  iMiZ3 - ^2 <PPzp+2 
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—  1  +  Oiz  +  &2Z^  +  B3Z?  +  •  ■  ■  +  9qZq 

Equating  the  coefficients  of  the  powers  of  z,  z! ,  j  =  0,  1 , 2, ,  one  obtains  the 
following  equations: 

z°  :  iAo  =  1 , 

z1  :  ft  =  B\  +  0iiAo  =  0\  +  tjn, 

z2  :  1A2  =  #2  +  •/hV'o  +  01^1  =  $2  +  ^>2  +  <^i^i  +  (/>\, 


As  can  be  seen,  it  is  possible  to  solve  recursively  for  the  unknown  coefficients  {1///}. 
This  is  convenient  when  it  comes  to  numerical  computations,  but  in  some  cases  one 
wants  an  analytical  solution.  Such  a  solution  can  be  obtained  by  observing  that,  for 
j  >  max  !/;,  q  +  1 },  the  recursion  leads  to  the  following  difference  equation  of  order 
P- 


p 

fj  =  ^2  Mj-k  =  01  Vo— 1  +  <htj-2  +  •  •  •  +  <Ppfj-p,  j  >  rnax{p,  q+  1}. 
k=  1 


This  is  a  linear  homogeneous  difference  equation  with  constant  coefficients.  The 
solution  of  such  an  equation  is  of  the  form  (see  Eq.  (B.l)  in  Appendix  B): 

=  c\Z^]  +  . . .  +  cpz~J,  j  >  ma x{p,  q  +  1}  -p,  (2.5) 

where  zi,  ■  ■  ■  ,zp  denote  the  roots  of  <b(z)  =  1  —  <j)\z  —  ...  —  <ppzp  =  0.4  Note 
that  the  roots  are  exactly  those  which  have  been  computed  to  assess  the  existence 
of  a  causal  representation.  The  coefficients  c\,...,cp  can  be  obtained  using  the  p 
boundary  conditions  obtained  from  l jrj  =  4>k^j-k  =  &j,  ma x{p,  q  +  1}  —  p  < 

j  <  max!/;,  q  +  1}.  Finally,  the  values  for  i frj,  0  <  j  <  max!/),  q  +  1 }  —  p,  must  be 
computed  from  the  hrst  ma  x{p,  q  +  1  }—p  iterations  (see  the  example  in  Sect.  2.4). 

As  mentioned  previously,  the  coefficients  {i///}  are  of  great  importance  as 
they  quantify  the  effect  of  a  shock  to  Zt-j  on  X,.  respectively  of  Z,  on  Xt+j.  In 
macroeconomics  they  are  sometimes  called  dynamic  multipliers  of  a  transitory  or 
temporary  shock.  Because  the  underlying  ARMA  process  is  stationary  and  causal, 
the  infinite  sum  1 1 jrj  \  converges.  This  implies  that  the  effect  i/zy  converges  to 


4In  the  case  of  multiple  roots  one  has  to  modify  the  formula  according  to  Eq.  (B.2). 
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zero  asj  — >  oo.  Thus  the  effect  of  a  shock  dies  out  eventually5 6: 


Mr+j 

dZ, 


=  tj 


0  for  j  ■ 


oo. 


As  can  be  seen  from  Eq.  (2.5),  the  coefficients  {i j/j)  even  converge  to  zero  exponen¬ 
tially  fast  to  zero  because  each  term  c,z;  1 ,  i  =  1 , ,p,  goes  to  zero  exponentially 
fast  as  the  roots  Zi  are  greater  than  one  in  absolute  value.  Viewing  { i//;  j  as  a  function 
of  j  one  gets  the  so-called  impulse  response  function  which  is  usually  displayed 
graphically. 

The  effect  of  a  permanent  shock  in  period  t  on  X,+J  is  defined  as  the  cumulative 
effect  of  a  transitory  shock.  Thus,  the  effect  of  a  permanent  shock  to  Xt +j  is  given  by 

E'=o  fi-  Because  E/=o  ti  -  £/=  o  IVol  <  E~o  IVol  <  °°> the  cumulative  effect 
remains  finite. 

In  time  series  analysis  we  view  the  observations  as  realizations  of  {A,}  and  treat 
the  realizations  of  {Z,}  as  unobserved.  It  is  therefore  of  interest  to  know  whether  it  is 
possible  to  recover  the  unobserved  shocks  from  the  observations  on  {X,}.  This  idea 
leads  to  the  concept  of  invertibility. 

Definition  2.3  (Invertibility).  An  ARMA(p,q)  process  for  {A,}  satisfying  <I>(L)A, 
=  @(L)Z,  is  called  invertible  with  respect  to  {Z,}  if  and  only  if  there  exists  a 
sequence  {nj}  with  the  property  E,-n  \ttj\  <  oo  such  that 


OO 

Z,  =  jtjXf—j . 

7=0 

Note  that  like  causality,  invertibility  is  not  an  attribute  of  {A,},  but  is  defined  only 
relative  to  another  process  {Z,}.  In  the  literature,  one  often  refers  to  invertibility  as 
the  strict  miniphase  property.'1 

Theorem  2.2.  Let  {A,}  be  an  ARMA(p,q)  process  with  <J>(L)A,  =  @(L)Z;  such 
that  polynomials  T(')  and  Q(z)  have  no  common  roots.  Then  {A,}  is  invertible  with 
respect  to  {Zt}  if  and  only  if  &{z)  0  for  |z|  <  1.  The  coefficients  {rtf)  are  then 

uniquely  determined  through  the  relation: 


5The  use  of  the  partial  derivative  sign  actually  represents  an  abuse  of  notation.  It  is  inspired  by  an 

alternative  definition  of  the  impulse  responses:  fj  =  where  P,  denotes  the  optimal  (in  the 

mean  squared  error  sense)  linear  predictor  of  X ,+j  given  a  realization  back  to  infinite  remote  past 

{xr,  x,— i ,  xt _ 2, . . .  }  (see  Sect.  3. 1 .3).  Thus,  fj  represents  the  sensitivity  of  the  forecast  of  X ,+j  with 

respect  to  the  observation  xt.  The  equivalence  of  alternative  definitions  in  the  linear  and  especially 
nonlinear  context  is  discussed  in  Potter  (2000). 

6Without  the  qualification  strict,  the  miniphase  property  allows  for  roots  of  0(z)  on  the  unit  circle. 
The  terminology  is,  however,  not  uniform  in  the  literature. 
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oo 


nfe)  =  J2  nj- 

j= 0 


®(z) 

®(z)' 


Proof.  The  proof  follows  from  Theorem  2.1  with  X,  and  Z,  interchanged. 


□ 


The  discussion  in  Sect.  1.3  showed  that  there  are  in  general  two  MA(1)  processes 
compatible  with  the  same  autocorrelation  function  p(h)  given  by  p ( 0 )  =  1,  p(l)  = 
p  with  |p|  <  l  and  p(h)  =  0  for  h  >  2.  However,  only  one  of  these  solutions 
is  invertible  because  the  two  solutions  for  9  are  inverses  of  each  other.  As  it  is 
important  to  be  able  to  recover  Z,  from  current  and  past  X,,  one  prefers  the  invertible 
solution.  Section  3.2  further  elucidates  this  issue. 


Remark  2.3.  If  { X, }  is  a  stationary  solution  to  the  stochastic  difference  equation 
<I>(L)A,  =  0(L)Z,  with  Z,  ~  WN(0,  cr2)  and  if  <I)(z)0(z)  ^  0  for  |z|  <  1  then 

OO 

X,  =  J2tZr~P 

j= 0 

OO 

Z,  =  njXt-j, 
j=  0 


and 


where  the  coefficients  {i j/j}  and  {tt;}  are  determined  for  \z\  <  1  by  T'(z)  = 

<J>(z) 

n<s)  =  — — ,  respectively.  In  this  case  {A,}  is  causal  and  invertible  with  respect 


®(z) 

®(z) 


to  {Z,}. 


®(z) 


Remark  2.4.  If  {A,}  is  an  ARMA  process  with  T(L)A,  =  0(L)Zr  such  that 
<3>(z)  f  0  for  z|  =  1  then  there  exists  polynomials  O(z)  and  0(z)  and  a  white 
noise  process  {Z,}  such  that  {A,}  fulfills  the  stochastic  difference  equation  5>(L)A,  = 
0(L)Z,  and  is  causal  with  respect  to  {Z,  J.  If  in  addition  (-)(-.)  f  0  for  |z|  =  1  then 
0(L)  can  be  chosen  such  that  {A,}  is  also  invertible  with  respect  to  {Z,  J  (see  the 
discussion  of  the  AR(1)  process  after  the  definition  of  causality  and  Brockwell  and 
Davis  (1991,  p.  88)).  Thus,  without  loss  of  generality,  we  can  restrict  the  analysis  to 
causal  and  invertible  ARMA  processes. 


2.4  Computation  of  the  Autocovariance  Function 
of  an  ARMA  Process 

Whereas  the  autocovariance  function  summarizes  the  external  and  directly  observ¬ 
able  properties  of  a  time  series,  the  coefficients  of  the  ARMA  process  give 
information  of  its  internal  structure.  Although  there  exists  for  each  ARMA  model 
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a  corresponding  autocovariance  function,  the  converse  is  not  true  as  we  have  seen 
in  Sect.  1.3  where  we  showed  that  two  MA(1)  processes  are  compatible  with  the 
same  autocovariance  function.  This  brings  up  a  fundamental  identification  problem. 
In  order  to  shed  some  light  on  the  relation  between  autocovariance  function  and 
ARMA  models  it  is  necessary  to  be  able  to  compute  the  autocovariance  function  for 
a  given  ARMA  model.  In  the  following,  we  will  discuss  three  such  procedures. 
Each  procedure  relies  on  the  assumption  that  the  ARMA  process  TIL) A,  = 
@(L )Zf  with  Z,  ~  WN(0,a2)  is  causal  with  respect  to  {Z,}.  Thus  there  exists  a 
representation  of  X,  as  a  weighted  sum  of  current  and  past  Zf’s:  X,  =  Ylj= o  tj^t-j 

with  £,“o  IVO'I  <  °°- 

2.4.1  First  Procedure 

Starting  from  the  causal  representation  of  {A,},  it  is  easy  to  calculate  its  autoco¬ 
variance  function  given  that  {Z,}  is  white  noise.  The  exact  formula  is  proved  in 
Theorem  (6.4). 


OO 


y(h)  =  a2Y,m+\h\, 

}=  0 


where 


for  |z|  <  1. 


The  first  step  consists  in  determining  the  coefficients  i j/j  by  the  method  of  undeter¬ 
mined  coefficients.  This  leads  to  the  following  system  of  equations: 


0  <k<j 


0  <k<p 


This  equation  system  can  be  solved  recursively  (see  Sect.  2.3): 


00  =  d0  =  1 , 

t\  =  #1  +  0o0i  =  0i  +  01, 

tl  =  02  +  0002  +  0101  =  02  +  02  +  0101  +  02, 
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Alternatively  one  may  view  the  second  part  of  the  equation  system  as  a  linear 
homogeneous  difference  equation  with  constant  coefficients  (see  Sect.  2.3).  Its 
solution  is  given  by  Eq.  (2.5).  The  first  part  of  the  equation  system  delivers  the 
necessary  initial  conditions  to  determine  the  coefficients  ci,c2, . . .  ,cp.  Finally  one 
can  insert  the  i//’s  in  the  above  formula  for  the  autocovariance  function. 

A  Numerical  Example 

Consider  the  ARMA(2,1)  process  with  <f>(L)  =  1  —  1.3L  +  0.4L2  and  0(L)  = 
1  +  0.4L.  Writing  out  the  defining  equation  for  'l'(z),  T(  7)43(7)  =  @(z),  gives: 

1  +  fiz  +  i/^z2  +  1A3  z3  +  ••• 

—  1.3z  —  1.31/qz2  —  Oi/^ z3  —  . . . 

+  0.4z2  +  0.4i/qz3  +  ... 

=  1  +  0.4z. 

of  the  powers  of  z  leads  to  the  following  equation  system: 
fo  =  1, 

fi  -  1.3  =  0.4, 
t/f2  —  1.3^i  +  0.4  =  0, 

1/^3  —  1.3^2  +  0.4i/q  =  0, 

1 f/j  —  1 . 3 Vy/—  1  +  0.4i//;_2  =  0,  for  j  >  2. 

The  last  equation  represents  a  linear  difference  equation  of  order  two.  Its  solution  is 
given  by 


Equating  the  coefficients 


tj  =  ciz/  +  c2Z2J,  j  >  max{p,  q  +  1} -p  =  0, 

whereby  z\  and  zi  are  the  two  distinct  roots  of  the  characteristic  polynomial 
^(z)  =  1  —  1.3z  +  0.4z2  =  0  (see  Eq.  (2.5))  and  where  the  coefficients  c \  and 
C2  are  determined  from  the  initial  conditions.  The  two  roots  are  L3±Vt<69-4x().4  _ 
5/4  =  1.25  and  2.  The  general  solution  to  the  homogeneous  equation  therefore  is 
1 jfj  =  ciO.87  +  C2O.5/  The  constants  c  1  and  c2  are  determined  by  the  equations: 


7  =  0:  1^0  =  1  =  C1O.80  +  c20.5°  =  ci  +  c2 

7=1:  x//i  =  U  =  C1O.81  +  C2O.51  =  0.8ci  +  0.5c2. 
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Solving  this  equation  system  in  the  two  unknowns  ci  and  ci  gives:  c\  =  4  and 
c 2  =  —3.  Thus  the  solution  to  the  difference  equation  is  given  by: 


fj  =  4(0.8y  -  3(0.5y. 

Inserting  this  solution  for  i jrj  into  the  above  formula  for  y(h)  one  obtains  after  using 
the  formula  for  the  geometric  sum: 

OO 

y(h)  =  a2  ^(4x  0.8/  -  3  x  O.^)  (4  x  0.8/+,!  -  3  x  0.5/+,‘) 
i=  o 


^2E(16  x  0.82j+h  -  12  x  0.57'  x  O.W+h 
j=  o 


12  x  0.87  x  0.5/+,!  +  9  x  0.52j+h) 


,  0.8''  ,  0.8''  ,  0.51'  ,  0.5'' 

16ct2- - -  12cr2- - —  -  12a2- - —  +  9a2 


1-0.64 


1-0.4 


1-0.4 


1  -  0.25 


220  ,  ,  ,  , 
— a2(0.8)/'  -  8a2(0.5),!. 


Dividing  y(h)  by  y(0),  one  gets  the  autocorrelation  function: 

p(h)  =  —  =  —  x  0.87'  -  —  x  0.57 
y(0)  37  37 


which  is  represented  in  Fig.  2.3. 


2.4.2  Second  Procedure 

Instead  of  determining  the  i j/j  coefficients  first,  it  is  possible  to  compute  the 
autocovariance  function  directly  from  the  ARMA  model.  To  see  this  multiply  the 
ARMA  equation  successively  by  X,-h,h  =  0, 1, . . .  and  apply  the  expectations 
operator: 


E XtXr-h  -  <p\KXt-\Xt-h - <PpEXt-pXt-h 


=  E Z,Xt-h  +  0iE  Zt-iXt-h  +  •  •  •  +  eqE Zt-qXt-h. 


This  leads  to  an  equation  system  for  the  autocovariances  y (/?).  h  =  0, 1, 2, ... : 

y  (h)  -  y  (h  -  1)  -  . . .  -  4>py(h  -  p)  =  a2  ^  Ojfj-h ,  h  <  max{p,  q  +  1 } 

h<i<q 


y(h)  -<pxy(h-  1)  -  ...  -  <ppy(h-p)  =  0, 


h  >  max{p,  q  +  1}. 
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Fig.  2.3  Autocorrelation  function  of  the  ARMA(2,1)  process:  (1  —  1.3L  +  0.4L2)X,  =  (1  + 
0.4L)Z, 

The  second  part  of  the  equation  system  consists  again  of  a  linear  homogeneous 
difference  equation  in  y(h)  whereas  the  first  part  can  be  used  to  determine  the  initial 
conditions.  Note  that  the  initial  conditions  depend  \jr\ , . . . ,  \/fq  which  have  to  be 
determined  before  hand.  The  general  solution  of  the  difference  equation  is: 

y(h)  =  dzTh  +  ■  ■  ■  +  cpz~h  (2.6) 

where  z\ ,zp  are  the  distinct  roots  of  the  polynomial  <J>(z)  =  1  —  cpi z  —  ...  — 
4>pzp  =  0.7  The  constants  c\,...,cp  can  be  computed  from  the  first  p  initial 
conditions  after  the  t/q, . . .  \frq  have  been  calculated  like  in  the  first  procedure.  The 
form  of  the  solution  shows  that  the  autocovariance  and  hence  the  autocorrelation 
function  converges  to  zero  exponentially  fast. 

A  Numerical  Example 

We  consider  the  same  example  as  before.  The  second  part  of  the  above  equation 
system  delivers  a  difference  equation  for  y(h):  y{h)  =  (piy(h  —  1)  +  <pi y(h  —  2)  = 
1 .3 y(h  —  1)  —  0.4y  (/;  —  2),  h  >  2.  The  general  solution  of  this  difference  equation 
is  (see  Appendix  B): 

y(h)  =  c1(0.8)h+c2(0.5)h,  h>  2 


7 In  case  of  multiple  roots  the  formula  has  to  be  adapted  accordingly.  See  Eq.  (B.2)  in  the  Appendix. 


2.4  Computation  of  Autocovariance  Function 


43 


where  0.8  and  0.5  are  the  inverses  of  the  roots  computed  from  the  same  polynomial 
4>(z)  =  1  -  1.3z  — 0.4z2  =  0. 

The  first  part  of  the  system  delivers  the  initial  conditions  which  determine  the 
constants  ci  and  c2: 

y( 0)  -  1.3y(— 1)  +  0.4/ (—2)  =  o2(  1  +  0.4  x  1.7) 

/(  1)  -  1 . 3 y ( 0 )  +  0.4/(— 1)  =  rr20.4 

where  the  numbers  on  the  right  hand  side  are  taken  from  the  first  procedure. 
Inserting  the  general  solution  in  this  equation  system  and  bearing  in  mind  that 
/(/?)  =  y(—h)  leads  to: 


0.216ci  +  0.450c2  =  1.68ct2 
-0.180ci  -0.600c2  =  0.40a-2 

Solving  this  equation  system  in  the  unknowns  ci  and  c2  one  gets  finally  gets:  c i  = 
(220/9 )a2  and  c2  =  -8a-2. 


2.4.3  Third  Procedure 

Whereas  the  first  two  procedures  produce  an  analytical  solution  which  relies  on 
the  solution  of  a  linear  difference  equation,  the  third  procedure  is  more  suited  for 
numerical  computation  using  a  computer.  It  rests  on  the  same  equation  system  as  in 
the  second  procedure.  The  first  step  determines  the  values  /(0),  /(  1), . . . ,  y(p)  from 
the  first  part  of  the  equation  system.  The  following  /(/?),  h  >  p  are  then  computed 
recursively  using  the  second  part  of  the  equation  system. 

A  Numerical  Example 

Using  again  the  same  example  as  before,  the  first  of  the  equation  delivers  y( 2),  y  (  I ) 
and  /(0)  from  the  equation  system: 

y(0)  -  1  -3y ( — 1 )  +  0.4/ (—2)  =  ct2(1  +  0.4  x  1.7) 
y (1)  -  1.3 y(0)  +  0.4y(— 1)  =  ct20.4 
y(2)  —  1.3y(l)  +  0.4/(0)  =  0 

Bearing  in  mind  that  y(h)  =  y(—h),  this  system  has  three  equations  in  three 
unknowns  y (0),  y ( 1 )  and  /( 2).  The  solution  is:  /(0)  =  (148/9)a2,  y ( 1 )  = 
(140/9)a2,  y  (2)  =  (614/45)ct2.  This  corresponds,  of  course,  to  the  same  numerical 
values  as  before.  The  subsequent  values  for  y(h),  h  >  2  are  then  determined 
recursively  from  the  difference  equation/)/?)  =  1 3y(h  —  1)  —  0.4y  (/?  —  2). 
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2.5  Exercises 

Exercise  2.5.1.  Consider  the  AR(1)  process  X,  =  0.8Zr_i  +  Z,  with  Z,  ~ 
WN(0,  a2).  Compute  the  variance  of  (X i  +  X2  +  Z3  +  Xf)/4. 

Exercise  2.5.2.  Check  whether  the  following  stochastic  difference  equations  pos¬ 
sess  a  stationary  solution.  If  yes,  is  the  solution  causal  and/or  invertible  with  respect 
to  Z,  ~  WN(0,  a2)? 


(i)  X,=Z,  +  2Z,-\ 

(ii)  Z,  =  1 .3Z,_|  +  Z, 

(iii)  Z,  =  1.3Zf_i  -  0.4Z,_2  +  Zf 

(iv)  Z,  =  1 .3Z,_i  -  0.4X,-2  +  Zt-  0.3Z,_, 

(v)  Z,  =  0.2Z,_i  +  0.8Zf_2  +  Z, 

(vi)  X,  =  0.2Zr_i  +  0.8Zf_2  +  Z,  -  1.5Zf_i  +  0.5Z,_2 

Exercise  2.5.3.  Compute  the  causal  representation  with  respect  to  Z,  ~  WN(0,  a2) 
for  the  following  ARMA  processes: 


(i)  Z,  =  1 .3Z,_i  -  0.4Z,_2  +  Z, 

(ii)  Z,  =  1.3Z,_i  -  0.4Z,_2  +Z,  -  0.2Z,_! 

(iii)  X,  =  fX,- 1  +  Z,  +  OZ,- 1  with  \cp\  <  1 

Exercise  2.5.4.  Compute  the  autocovariance  function  of  the  ARMA  processes: 


(i)  X,  =  0.5Z,_!  +  0.36Zf_2  +  Zf 

(ii)  Z,  =  0.5Z,_!  +  0.36Z,_2  +  Z,  +  0.5Zr_i 


Thereby  Z,  ~  WN(0,  a2). 

Exercise  2.5.5.  Verify  that  the  process  {Z,}  defined  in  Eq.  (2.4)  is  white  noise  with 
Z,  ~  WN(0,  0“2ct2). 


Forecasting  Stationary  Processes 


3 


An  important  goal  of  time  series  analysis  is  forecasting.  In  the  following  we  will 
consider  the  problem  of  forecasting  Xj h  >  0,  given  \XT, . . . ,  X\  \  where  { X,  } 
is  a  stationary  stochastic  process  with  known  mean  /i  and  known  autocovariance 
function  y(h).  In  practical  applications  p.  and  y  are  unknown  so  that  we  must  replace 
these  entities  by  their  estimates.  These  estimates  can  be  obtained  directly  from  the 
data  as  explained  in  Sect.  4.2  or  indirectly  by  first  estimating  an  appropriate  ARMA 
model  (see  Chap.  5)  and  then  inferring  the  corresponding  autocovariance  function 
using  one  of  the  methods  explained  in  Sect.  2.4.  Thus  the  forecasting  problem  is 
inherently  linked  to  the  problem  of  identifying  an  appropriate  ARMA  model  from 
the  data  (see  Deistler  and  Neusser  2012). 


3.1  The  Theory  of  Linear  Least-Squares  Forecasts 

We  restrict  our  discussion  to  linear  forecast  functions,  also  called  linear  predictors , 
PrXj+h-  Given  observation  from  period  1  up  to  period  T,  these  predictors  take  the 
form: 


T 

PtXt+ii  =  no  +  ti\Xr  +  . . .  +  ajX\  =  ao  +  aiXj+i-i 

i=i 

with  unknown  coefficients  aa,a\,a2,  ■  ■  ■  ,aT.  In  principle,  we  should  index  these 
coefficients  by  T  because  they  may  change  with  every  new  observations.  See  the 
example  of  the  MA(1)  process  in  Sect.  3.1.2.  In  order  not  to  overload  the  notation, 
we  will  omit  this  additional  index. 

In  the  Hilbert  space  of  random  variables  with  finite  second  moments  the  optimal 
forecast  in  the  mean  squared  error  sense  is  given  by  the  conditional  expectation 
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¥,(Xt+h\c,Xt,Xt-\,  . . . ,  X\).  However,  having  practical  applications  in  mind,  we 
restrict  ourself  to  linear  predictors  for  the  following  reasons1 : 


(i)  The  determination  of  the  conditional  expectation  is  usually  very  difficult 
because  all  possible  functions  must  in  principle  be  considered  whereas  linear 
predictors  are  easy  to  compute. 

(ii)  The  coefficients  of  the  optimal  (in  the  sense  of  means  squared  errors)  linear 
forecasting  function  depend  only  on  the  first  two  moments  of  the  time  series, 
i.e.  on  EX,  and  y(j),j  =  0,1 , ,h  +  T  —  1. 

(iii)  In  the  case  of  Gaussian  processes  the  conditional  expectation  coincides  with 
the  linear  predictor. 

(iv)  The  optimal  predictor  is  linear  when  the  process  is  a  causal  and  invertible 
ARMA  process  even  when  Z,  follows  an  arbitrary  distribution  with  finite 
variance  (see  Rosenblatt  2000,  chapter  5). 

(v)  Practical  experience  has  shown  that  even  non-linear  processes  can  be  predicted 
accurately  by  linear  predictors. 


The  coefficients  ao, . . . ,  aj  of  the  forecasting  function  are  determined  such  that 
the  mean  squared  errors  are  minimized.  The  use  of  mean  squared  errors  as  a  criterion 
leads  to  a  compact  representation  of  the  solution  to  the  forecasting  problem.  It 
implies  that  over-  and  underestimation  are  treated  equally.  Thus,  we  have  to  solve 
the  following  minimization  problem: 

S  =  S(ao . ar)  =  E  ( Xt+i ,  —  P  tXt+/i)~ 

=  E  (Xr+h  —  no  —  a\Xj  —  ...  —  ajX  i)2  — >  min 

ao,ai,...,aT 


As  S  is  a  quadratic  function,  the  coefficients,  aj,  j  =  0,  1, ... ,  T,  are  uniquely 
determined  by  the  so-called  normal  equations.  These  are  obtained  from  the  first 
order  conditions  of  the  minimization  problem,  i.e.  from  ^  =  0,  j  =  0, 1, . . . ,  T: 


3  S 
3  a0 

3  S 
da. 


=  E  |  XT+h  —  ciq  —  ^2  u,XT+ 1 J  =  0, 
t  \ 

Xr+h  —  «o  —  ^2  a‘X'r+ 1 J  XT+\ -j 


=  E 


=  0,  j  =  l, ...  ,T. 


(3.1) 

(3.2) 


The  first  equation  can  be  rewritten  as  a0  =  fi  —  ')2'i=\  a'll  so  that  the  forecasting 
function  becomes: 


T 

PTXT+h  =  /r  +  ^ 2  ai  (X-r+i-i  ~  ft)  • 

;=i 


'Elliott  and  Timmermann  (2008)  provide  a  general  overview  of  forecasting  procedures  and  their 
evaluations. 
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The  unconditional  mean  of  the  forecast  error,  E  (Xt+i,  —  PtXt+i,),  is  therefore  equal 
to  zero.  This  means  that  there  is  no  bias,  neither  upward  nor  downward,  in  the 
forecasts.  The  forecasts  correspond  on  average  to  the  “true”  value. 

Inserting  in  the  second  normal  equation  the  expression  for  PTXj+h  from  above, 
we  get: 


E  [( XT+h  -  PTXT+h)  XT+i-j]  =  0,  j  =  1 , 2, . . . ,  T. 


The  forecast  error  is  therefore  uncorrelated  with  the  available  information  repre¬ 
sented  by  past  observations.  Thus,  the  forecast  errors  Xt+i,  —  PtXt+i,  are  orthogonal 
to  Xj ,  Xt-  i  , . . . ,  Aj .  Geometrically  speaking,  the  best  linear  forecast  is  obtained  by 
finding  the  point  in  the  linear  subspace  spanned  by  {AY,  AY-i , . . . .  Aj  {  which  is 
closest  to  Xj+h .  This  point  is  found  by  projecting  Xt+i,  on  this  linear  subspace.2 

The  normal  equations  (3.1)  and  (3.2)  can  be  rewritten  in  matrix  notation  as 
follows: 


ao 


/*(■-!>) 


(3.3) 


(  y(0)  y( i) 

■  ■y(T-  m 

(aA 

(  y(h)  \ 

y(i)  K(0) 

■  ■y(T- 2) 

fl2 

= 

y(h  +  1) 

\Y(T-  1)  y(T-  2) 

■  ■  Y(  0)  ) 

\aT) 

\y(h  +  T-l)J 

Denoting  by  t,  0:7-  and 

Yr{h)  the 

vectors 

(1.1,..., iy,  (ai. 

(y(h), 

...,y(h  +  T-  1))' 

and  by  Fy 

= 

[y(i 

~  j)]i.j=  1 . T  tt 

(3.4) 


T  x  T  covariance  matrix  of  (A) , . . . ,  A +)'  the  normal  equations  can  be  written 
compactly  as: 


ao  =  M  (1  —  i'aT)  (3.5) 

rTaT  =  Yr(h).  (3.6) 

Dividing  the  second  equation  by  y(0),  one  obtains  an  equation  in  terms  autocorre¬ 
lations  instead  of  autocovariances: 


Rt&t  =  PriK)t  (3.7) 

where  Rt  =  Tt/y{ 0)  and  PtW  =  ( p(h ) _ _  p(h  +  T  —  1)/.  The  coefficients  of 

the  forecasting  function  a-;  are  then  obtained  by  inverting  Tj,  respectively  Rj: 


2Note  the  similarity  of  the  forecast  errors  with  the  least-square  residuals  of  a  linear  regression. 
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(Xt 


W/ 


r7'y-,(/t)  =  RTxpT{h). 


A  sufficient  condition  which  ensures  the  invertibility  of  R/,  respectively  A1/,  is  given 
by  assuming  y(0)  >  0  and  lini/,^^  y(h)  =  0.3  The  last  condition  is  automatically 
satisfied  for  ARMA  processes  because  y(h)  converges  even  exponentially  fast  to 
zero  (see  Sect.  2.4). 

The  mean  squared  error  or  variance  of  the  forecast  error  for  the  forecasting 
horizon  h,  Vj(h),  is  given  by: 


vT(h)  =  E  ( XT+h  ~  PTXT+h)2 

T  T  T 

=  y(0)  -2  aY(h  +  '-1)  +  EE  —f)aj 

i=  1  i=  1  j=  1 

=  y(0)  -  2 a'TyT(h)  +  a'TTTaT 
=  y(0)  —  a'ryT(h), 

because  Yjolt  =  yr(h).  Bracketing  out  y(0),  one  can  write  the  mean  squared 
forecast  error  as: 


vT(h)  =  y(0)  (1  -  a'TpT(h)) .  (3.8) 

Because  the  coefficients  of  the  forecast  function  have  to  be  recomputed  with 
the  arrival  of  every  new  observation,  it  is  necessary  to  have  a  fast  and  reliable 
algorithm  at  hand.  These  numerical  problems  have  been  solved  by  the  development 
of  appropriate  computer  algorithms,  like  the  Durbin-Levinson  algorithm  or  the 
innovation  algorithm  (see  Brockwell  and  Davis  1991,  Chapter  5). 


3.1 .1  Forecasting  with  an  AR(p)  Process 

Consider  first  the  case  of  an  AR(1)  process: 

X,  =  fXt- 1  +  Z,  with  |0|  <  1  and  Z,  ~  WN(0,  a2). 
The  equation  system  (3.7)  becomes: 


3See  Brockwell  and  Davis  (1991,  p.  167). 
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/  1  <p  <p2 

<a\\ 

(  <ph  \ 

4>  1  </>...  4>t~ 2 

a2 

<j)h+1 

cj)2  (j)  1  . . .  (j)T- 3 

= 

<ph+ 2 

V0r_1  <pT~2  <pT~ 3 ...  1  ) 

\ut) 

\<pb+T~  7 

The  guess-and-verify  method  immediately  leads  to  the  solution: 

aT  =  (ai,d2,  ai, . . . ,  cit)'  =  (4>h,  0, 0, _ 0)7 . 


We  therefore  get  the  following  predictor: 


FtXt+i,  =  cj)hXT. 


The  forecast  therefore  just  depends  on  the  last  observation  with  the  corresponding 
coefficient  a\  =  <//'  being  independent  of  T.  All  previous  observations  can  be 
disregarded,  they  cannot  improve  the  forecast  further.  To  put  it  otherwise,  all  the 
useful  information  about  XT+h  in  the  entire  realization  previous  to  XT,  i.e.  in 
{X/-,  Xj—  1 , . . .  ,Xi},  is  contained  in  XT. 

The  variance  of  the  prediction  error  is  given  by 


vT(h)  = 


1  -<f>2h 
1  -(f)2 


O 


2 


For  h  =  1,  the  formula  simplifies  to  a 2  and  for  h  — »•  00,  Vj{h)  —*■  jz^ia2  the 
unconditional  variance  of  X,.  Note  also  that  the  variance  of  the  forecast  error  Vj(h) 
increases  with  h. 

The  general  case  of  an  AR(p)  process,  p  >  1,  can  be  treated  in  the  same  way. 
The  autocovariances  follow  a  /Mh  order  difference  equation  (see  Sect.  2.4): 


y(j)  =  4>iy(j-  1)  +  —  2)  +  ...  +  <j>Py(j-p). 

Applying  again  the  guess-and-verify  method  for  the  case  h  =  1  and  assuming  that 

T  >  p,  the  solution  is  given  by  oij  =  (cpi,<p2 . <pp,0, . . .  ,0)  .  Thus  the  one-step 

ahead  predictor  is 


VtXt+  1  =  (piXj  +  (piXr-i  +  . . .  +  4>pXt+  \-p,  T  >  p.  (3.9) 

The  one-step  ahead  forecast  of  an  AR(p)  process  therefore  depends  only  on  the  last 
p  observations. 

The  above  predictor  can  also  be  obtained  in  a  different  way.  View  for  this  purpose 
Pr  as  an  operator  with  the  following  meaning:  Take  the  linear  least-squares  forecast 
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with  respect  to  the  information  { XT , . . .  ,X\}.  Apply  this  operator  to  the  defining 
stochastic  difference  equation  of  the  AR(p)  process  forwarded  one  period: 

¥tXt+\  =  P t  ( <PiXt )  +  Pr  +  . . .  +  Pr  (<PpXr+  i-P)  +  Pr  (Zt+  i)  • 

In  period  T  observations  of  . .  .  ,X\  are  known  so  that  PrXj-j  =  Xp-j, 

j  =  0, 1, . . . ,  T  —  1.  Because  {Z, }  is  a  white  noise  process  and  because  {X,}  is 
a  causal  function  with  respect  to  { Z,} ,  ZT+  \  is  uncorrelated  with  XT, . . .  ,X\.  This 
reasoning  leads  to  the  same  predictor  as  in  Eq.  (3.9). 

The  forecasting  functions  for  h  >  1  can  be  obtained  recursively  by  successively 
applying  the  forecast  operator.  Take,  for  example,  the  case  h  =  2: 

PtXj+2  =  Pr  ((PiXt+i)  +  Pr  (4>2 Xt)  +  ■  ■  ■  +  Pr  (4>pXt+2-p)  +  Pr  (Zt+  2) 

=  (pi  {(P\Xt  +  <p2X-[- 1  +  •  ■  •  +  cppXj+i-p ) 

+  4>2Xt  +  . . .  +  (ppXj+2-p 

=  (<p\  +  (P2)  Xp  +  {<p\<p2  +  <P?>)Xt-\  +  . . .  +  (cp\(pp-\  +  1 pp)  Xj+2-p 

+  (pifppXr+i-p- 

In  this  way  forecasting  functions  for  h  >  2  can  be  obtained  recursively. 

Note  that  in  the  case  of  AR(p)  processes  the  coefficient  of  the  forecast  function 
remain  constant  as  long  as  T  >  p.  Thus  with  each  new  observation  it  is  not  necessary 
to  recompute  the  equation  system  and  solve  it  again.  This  will  be  different  in  the  case 
of  MA  processes.  In  practice,  the  parameters  of  the  AR  model  are  usually  unknown 
and  have  therefore  be  replaced  by  some  estimate.  Section  14.2  investigates  in  a  more 
general  context  how  this  substitution  affects  the  results. 


3.1 .2  Forecasting  with  MA(q)  Processes 

The  forecasting  problem  becomes  more  complicated  in  the  case  of  MA(q)  processes. 
In  order  to  get  a  better  understanding  we  analyze  the  case  of  a  MA(1)  process: 

X,  =  Z,  +  QZt- 1  with  \  9\  <  1  and  Z,  ~  WN(0,  a2). 

Taking  a  forecast  horizon  of  one  period,  i.e.  h =  1,  the  equation  system  (3.7)  in  the 
case  of  a  MA(1)  process  becomes: 


/  1  T+02  0 

/«i\ 

/  0  \ 

T+07 

W  1 

02 

0 

0  1  ...0 

a3 

= 

0 

•  •  0 

•  •  0 

•  •  0 

\aT) 

l  0  ) 

(3.10) 
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Despite  the  fact  that  the  equation  system  has  a  simple  structure,  the  forecasting 
function  will  depend  in  general  on  all  past  observations  of  Xj-j,  0  <  j  <  T.  We 
illustrate  this  point  by  a  numerical  example  which  will  allow  us  to  get  a  deeper 
understanding. 

Suppose  that  we  know  the  parameters  of  the  MA(1)  process  to  be  9  =  —0.9 
and  rr2  =  1.  We  start  the  forecasting  exercise  in  period  T  =  0  and  assume  that, 
at  this  point  in  time,  we  have  no  observation  at  hand.  The  best  forecast  is  therefore 
just  the  unconditional  mean  which  in  this  example  is  zero.  Thus,  PoA)  =  0.  The 
variance  of  the  forecast  error  then  is  V(Aj  —  PoAj)  =  uo(l)  =  a2  +  02o2  =  1.81. 
This  result  is  summarized  in  the  first  row  of  Table  3 . 1 .  In  period  1 ,  the  realization  of 
X |  is  observed.  This  information  can  be  used  and  the  forecasting  function  becomes 
l[j  Aj  =  u\X\.  The  coefficient  a\  is  found  by  solving  the  equation  system  (3.10) 
for  T  =  1.  This  gives  ci\  =  0/(1  +  82)  =  —0.4972.  The  corresponding 
variance  of  the  forecast  error  according  to  Eq.  (3.8)  is  V(Aj  —  Ifj  Aj)  =  iq  (1)  = 
y(0)(l  -  ajpiU))  =  1.81(1  -  0.4972  x  0.4972)  =  1.3625.  This  value  is  lower 
compared  to  the  previous  forecast  because  additional  information,  the  observation 
of  the  realization  of  Aj ,  is  taken  into  account.  Row  2  in  Table  3.1  summarizes  these 
results. 

In  period  2,  not  only  Aj ,  but  also  Aj  is  observed  which  allows  us  to  base  our 
forecast  on  both  observations:  P2Aj  =  <71X2  +  cioX \ .  The  coefficients  can  be  found 
by  solving  the  equation  system  (3.10)  for  T  =  2.  This  amounts  to  solving  the 
simultaneous  equation  system 


Inserting  9  =  —0.9,  the  solutionis  0*2  =  (01,02)'  =  (-0.6606,-0.3285)'.  The 
corresponding  variance  of  the  forecast  error  becomes 

V(Aj  -  P2X3)  =  112(1)  =  y(0)(l  -  a'2p2(  1)) 

=  y( 0)(l-(at  a2) 

=  1.81  ( 1  -  (-0.6606  -0.3285)  ( ~°’q972)  )  =  1-2155. 

These  results  are  summarized  in  row  3  of  Table  3.1. 

In  period  3,  the  realizations  of  A) ,  Aj  and  Aj  are  known  so  that  the  forecast 
function  becomes  IPjAj  =  «i  Aj  +  ajX2  +  a.aAj .  The  coefficients  can  again  be  found 
by  solving  the  equation  system  (3.10)  for  T  =  3: 
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Table  3.1  Forecast  function  for  a  MA(1)  process  with  9  =  —0.9  and  a2  =  1 


Time 

Forecasting  function  Ut  =  . arY 

Forecast  error  variance 

T  =  0 

v0(l)  =  1.8100 

T=  1 

«!  =  (-0.4972)' 

v,(l)  =  1.3625 

7=2 

a,  =  (-0.6606.  -0.3285)' 

t>2(l)  =  1.2155 

7=3 

a,  =  (-0.7404,-0.4891,-0.2432)' 

i)3  (1)  =  1.1436 

7=4 

o-4  =  (-0.7870,  -0.5827,-0.3849,-0.1914)' 

u4(l)  =  1-1017 

7  =  00  : 

aoo  =  (-0.9000,  -0.8100,  -0.7290, . . .)' 

Wood)  =  1 

For  6  =  —0.9,  the  coefficients  of  the  linear  predictor  are  a 2  =  (a, .  «2.  a3)'  = 
(—0.7404, —0.4891, —0.2432)'.  The  corresponding  variance  of  the  forecast  error 
becomes 


V(X4  -  P3X4)  =  u3(l)  =  y(0)(l  -  a'p3(l)) 


6 

1  +  05 


y(0)  I  1  -  (ai  a2  a3)  |  0 


=  1.81  1  -  (-0.7404  -0.4891  -0.2432) 


-0.4972) 
0 
0 


=  1.1436. 


These  results  are  summarized  in  row  4  of  Table  3.1.  We  can,  of  course,  continue  in 

this  way  and  derive  successively  the  forecast  functions  for  T  =  4,5, _ 

From  this  exercise  we  can  make  several  observations. 


•  In  contrast  to  the  AR  process,  every  new  information  is  used.  The  forecast 

P7-A7-+1  depends  on  all  available  information,  in  particular  on  XT ,  Xj- 1 . X\. 

•  The  coefficients  of  the  forecast  function  are  not  constant.  They  change  as  more 
and  more  information  comes  in. 

•  The  importance  of  the  new  information  can  be  “measured”  by  the  last  coefficient 
of  a-f-  These  coefficients  are  termed  partial  autocorrelations  (see  Definition  3.2) 
and  are  of  particular  relevance  as  will  be  explained  in  Sect.  3.5.  In  our  example 
they  are  -0.4972,  -0.3285,  -0.2432,  and  -0.1914. 

•  As  more  information  becomes  available,  the  variance  of  the  forecast  error 
(mean  squared  error)  declines  monotonically.  It  will  converge  to  rr2  =  1 .  The 
reason  for  this  result  can  be  explained  as  follows.  Applying  the  forecasting 
operator  to  the  defining  MA(1)  stochastic  difference  equation  forwarded  by 
one  period  gives:  PjXt+i  =  PtZt+i  +  OPjZt  =  9P jZj  with  forecast  error 
X1  + 1  —  ¥‘tXt+  1  =  Zy  + 1 .  As  more  and  more  observation  become  available,  it 
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becomes  better  and  better  possible  to  recover  the  “true”  value  of  the  unobserved 
Zj  from  the  observations  XT,  Xj-\ , . . .  ,X\.  As  the  process  is  invertible,  in  the 
limit  it  is  possible  to  recover  the  value  of  Zj  exactly  (almost  surely).  The  only 
uncertainty  remaining  is  with  respect  to  Zj+\  which  has  a  mean  of  zero  and  a 
variance  of  a2  =  1 . 


3.1 .3  Forecasting  from  the  Infinite  Past 

The  forecasting  function  based  on  the  infinitely  remote  past  is  of  particular 
theoretical  interest.  Thereby  we  look  at  the  problem  of  finding  the  optimal  linear 
forecast  of  XT+  t  given  Xj,  Xj- | , . . . ,  Xi ,  Xo ,  A- 1 , . . .  taking  the  mean  squared  error 
again  as  the  criterion  function.  The  corresponding  forecasting  function  (predictor) 
will  be  denoted  by  PTXr+h,  h  >  0. 

Noting  that  the  MA(1 )  process  with  \0\  <  1  is  invertible,  we  have 

z,  =  x,  —  ex,-i  +  e2x,-2 

We  can  therefore  write  Xr+  \  as 

Xi+t  =  zt+1  +  0  {X,  —  6X,-i  +  92X,-2  —  ■  ■ 

z, 


The  predictor  of  XT+ 1  from  the  infinite  past,  Py,  is  then  given  by: 

PyAy-i-i  =  0  (Xt  —  0Xf—\  T  Q~Xt—2  —  .  .  .) 
where  the  mean  squared  forecasting  error  is 

foo(l)  =  E  (Ay+i  —  VTXT+\)  =  a2. 
Applying  this  result  to  our  example  gives: 

PyAy-i-i  =  — O.QXj'  —  O-SlAy—i  —  0.729Xf—2  —  . . . 


with  1)00(1)  =  1.  See  last  row  in  Table  3.1. 

Example  of  an  ARMA(1,1 )  Process 

Consider  now  the  case  of  a  causal  and  invertible  ARMA(1,1)  process  {A,}: 

x,  =  fixr^+z,  +  ezr_l. 
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where  \<p\  <  1,  \9\  <  1  and  Z,  ~  WN(0,  a2).  Because  {X,  \  is  causal  and  invertible 
with  respect  to  {Z,}, 


Xt+  i  =  Zy+ 1  +  (fi  +  9)  ^2  ft'Z-r-j, 
j= 0 

OO 

Zy+1  =  Zy+i  —  (0  +  0)  y ^(—dyXT-j. 

j= 0 

Applying  the  forecast  operator  Py  to  the  second  equation  and  noting  that 
PyZy+i  =  0,  one  obtains  the  following  one-step  ahead  predictor 

OO 

PyXy+1  =  (cp  +  9)  Yi-eyxr-j. 
j=  0 

Applying  the  forecast  operator  to  the  first  equation,  we  obtain 

OO 

PyAy+1  =  (<P  +  9)  ^ZT-j. 

7=0 

This  implies  that  the  one-step  ahead  prediction  error  is  equal  to  XT+\  —  PyXy+i  = 
Zy+1  and  that  the  mean  squared  forecasting  error  of  the  one-step  ahead  predictor 
given  the  infinite  past  is  equal  to  EZ^+I  =  a2. 


3.2  The  Wold  Decomposition  Theorem 

The  Wold  Decomposition  theorem  is  essential  for  the  theoretical  understanding  of 
stationary  stochastic  processes.  It  shows  that  any  stationary  process  can  essentially 
be  represented  as  a  linear  combination  of  current  and  past  forecast  errors.  Before  we 
can  state  the  theorem  precisely,  we  have  to  introduce  the  following  definition. 

Definition  3.1  (Deterministic  Process).  A  stationary  stochastic  process  {X,}  is 
called  (purely)  deterministic  or  (purely)  singular  if  and  only  if  it  can  be  forecasted 
exactly  from  the  infinite  past.  More  precisely,  if  and  only  if 

a2  =  E  (X,+i  —  P,A,+  i)2  =  0  for  all  t  e  Z 

where  V,X,+  \  denotes  the  best  linear  forecast  of  Xt+\  given  its  infinite  past,  i.e.  given 
{X„X, 

The  most  important  class  of  deterministic  processes  are  the  harmonic  processes. 
These  processes  are  characterized  by  finite  or  infinite  sums  of  sine  and  cosine 
functions  with  stochastic  amplitude.4  A  simple  example  of  a  harmonic  process  is 
given  by 


4More  about  harmonic  processes  can  be  found  in  Sect.  6.2. 
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X,  =  A  cos(ft)f)  +  B  sin(ftjf)  with  co  e  (0,  tt). 


Thereby,  A  and  B  denote  two  uncorrelated  random  variables  with  mean  zero  and 
finite  variance.  One  can  check  that  X,  satisfies  the  deterministic  difference  equation 


X,  =  (2  cos  —X,-2. 


Thus,  X,  can  be  forecasted  exactly  from  its  past.  In  this  example  the  last  two  obser¬ 
vations  are  sufficient.  We  are  now  in  a  position  to  state  the  Wold  Decomposition 
Theorem. 

Theorem  3.1  (Wold  Decomposition).  Every  stationary  stochastic  process  {Xr)  with 
mean  zero  and  finite  positive  variance  can  be  represented  as 


OO 


(3.11) 


j= o 


where 


(i)  Zt=Xt-W>t-1Xt  =  W>tZt; 

(ii)  Z,  ~  WN(0,  a2)  with  a1  =  E  (Xt+l  -  P,X,+1)2  >  0; 

(iii)  fo  =  1  and  <  00 >' 

(iv)  {Vi}  is  deterministic; 

(v)  E (Z,VS)  =  0  for  all  t,  s  e  Z. 

The  sequences  {fij},  { Zt },  and  {V,}  are  uniquely  determined  by  (3.11). 

Proof.  The  proof,  although  insightful,  requires  some  knowledge  about  Hilbert 
spaces  which  is  beyond  the  scope  of  this  book.  A  rigorous  proof  can  be  found  in 
Brockwell  and  Davis  (1991,  Section  5.7). 

It  is  nevertheless  instructive  to  give  an  intuition  of  the  proof.  Following  the 
MA(1)  example  from  the  previous  section,  we  start  in  period  0  and  assume  that 
no  information  is  available.  Thus,  the  best  forecast  PqXi  is  zero  so  that  trivially 


A i  =  X\  —  P0A,  =  Z\ . 


Starting  with  Ai  =  Z| ,  W,  W, . . .  can  then  be  constructed  recursively: 


x2=  x2-  P1X2  +  PiA2  =  Z2  +  a^X  1  =  Z2  +  a^Zx 
A3  =  Z3  -  P2A3  +  P2X3  =  Z3  +  a^X2  +  afxx 


X4  =  X4  —  P3A4  +  P3A4  =  Z4  +  af]X3  +  a{2]X 2  +  afpXi 
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Z4  +  <7<13)Z3  + 

(■ 


(2)  ,  (3) 


+  a* 


>)z2 


.  ,  (3)  (2)  (1)  ,  (3)  (2)  .  (3)  (1)  .  (3)3 

+  ( a  j  a\  a\  +  a\  a)  +  a\  a\  +  a\ 


where  a  ’  1 \j  =  1, . . . ,  t  —  1,  denote  the  coefficients  of  the  forecast  function  for  X, 
based  on  X,-\, . . .  ,X\.  This  shows  how  X,  unfolds  into  the  sum  of  forecast  errors. 
The  stationarity  of  { X, }  ensures  that  the  coefficients  of  Z;  converge,  as  1  goes  to 
infinity,  to  \jjj  which  are  independent  of  t.  □ 

Every  stationary  stochastic  process  is  thus  representable  as  the  sum  of  a  moving- 
average  of  infinite  order  and  a  (purely)  deterministic  process.5  The  weights  of  the 
infinite  moving  average  are  thereby  normalized  such  that  t/r0  =  1.  In  addition, 
the  coefficients  1 jtj  are  square  summable.  This  property  is  less  strong  than  absolute 
summability  which  is  required  for  a  causal  representation  (see  Definition  2. 2). 6  The 
process  {Z,}  is  a  white  noise  process  with  positive  variance  a1  >  0.  The  Z,’s  are 
called  innovations  as  they  represent  the  one-period  ahead  forecast  errors  based  on 
the  infinite  past,  i.e  Z,  =  X,  —  ¥‘,-\X,.  Z ,  is  the  additional  information  revealed 
from  the  r-th  observation.  Thus,  the  Wold  Decomposition  Theorem  serves  as  a 
justification  for  the  use  of  causal  ARMA  models.  In  this  instance,  the  deterministic 
component  { V,}  vanishes. 

The  second  part  of  Property  (i)  further  means  that  the  innovation  process  {Z,j 
is  fundamental  with  respect  to  { Xf,  i.e.  that  Z,  lies  in  the  linear  space  spanned  by 
{X,,Xt-i,Xt-2, . . .}  or  that  Z,  =  P ,Zt.  This  implies  that  'P(L)  must  be  invertible 
and  that  Z,  can  be  perfectly  (almost  surely)  recovered  from  the  observations  of 

Xt,Xr-i _ Finally,  property  (v)  says  that  the  two  components  [Z, |  and  { V, j  are 

uncorrelated  with  each  other  at  all  leads  and  lags.  Thus,  in  essence,  the  Wold 
Decomposition  Theorem  states  that  every  stationary  stochastic  process  can  be 
uniquely  decomposed  into  a  weighted  sum  of  current  and  past  forecast  errors  plus  a 
deterministic  process. 

Although  the  Wold  Decomposition  is  very  appealing  from  a  theoretical  perspec¬ 
tive,  it  is  not  directly  implementable  in  practice  because  it  requires  the  estimation 
of  infinitely  many  parameters  (\jj\ ,  xf2, .  ■  .)■  This  is  impossible  with  only  a  finite 
amount  of  observations.  It  is  therefore  necessary  to  place  some  assumptions  on 
1//2, . . .).  One  possibility  is  to  assume  that  {A,}  is  a  causal  ARMA  process  and 


5  The  Wold  Decomposition  corresponds  to  the  decomposition  of  the  spectral  distribution  function 
of  F  into  the  sum  of  Fz  and  Fy  (see  Sect.  6.2).  Thereby  the  spectral  distribution  function  Fz  has 
spectral  density /Z(A)  =  |||'I'(e— a)\2. 

sThe  series  =  1  /j,  for  example,  is  square  summable,  but  not  absolutely  summable. 
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to  recover  the  i/r/s  from  the  causal  representation.  This  amounts  to  say  that  T(L)  is 
a  rational  polynomial  which  means  that 

=  0(L)  =  1  +  ftL  +  o2l2  +  ...  +  eqv 
4>(L)  l-foL-foL2- ...-^W 

Thus,  the  process  is  characterized  by  only  a  finite  number,  p  +  q,  of  parameters. 
Another  possibility  is  to  place  restrictions  on  the  smoothness  of  the  spectrum  (see 
Chap.  6). 

The  Wold  Decomposition  Theorem  has  several  implications  which  are  presented 
in  the  following  remarks. 

Remark  3.1.  In  the  case  of  ARMA  processes,  the  purely  deterministic  part  { V,  j 
can  be  disregarded  so  that  the  process  is  represented  only  by  a  weighted  sum  of 
current  and  past  innovations.  Processes  with  this  property  are  called  purely  non- 
detenninistic,  linearly  regular,  or  regular  for  short.  Moreover,  it  can  be  shown 
that  every  regular  process  { X,  J  can  be  approximated  arbitrarily  well  by  an  ARMA 
process  {^ARMA)j  meanjng  that 

supE(x,-4ARMA))2 

fez  '  ' 

can  be  made  arbitrarily  small.  The  proof  of  these  results  can  be  found  in  Hannan 
and  Deistler  (1988,  Chapter  1). 

Remark  3.2.  The  process  {Z,}  is  white  noise,  but  not  necessarily  Gaussian.  In 
particular,  {Z,}  need  not  be  independently  and  identically  distributed  (IID).  Thus, 

E(Z,+  1  \Xt,  X,-\ _ )  need  not  be  equal  to  zero  although  P,Zt+ 1  =  0.  The  reason  is 

that  P,Z,+i  is  only  the  best  linear  forecast  function,  whereas  E(Z,+i  \Xt,Xt-\, . . .)  is 
the  best  forecast  function  among  all  linear  and  non-linear  functions.  Examples  of 
processes  which  are  white  noise,  but  not  IID,  are  GARCH  processes  discussed  in 
Chap.  8. 

Remark  3.3.  The  innovations  {Z,}  may  not  correspond  to  the  “true”  shocks  of 
the  underlying  economic  system.  In  this  case,  the  shocks  to  the  economic  system 
cannot  be  recovered  from  the  Wold  Decomposition.  Thus,  they  are  not  fundamental 
with  respect  to  {X,}.  Suppose,  as  a  simple  example,  that  { X, }  is  generated  by  a 
noninvertible  MA(1)  process: 

Xt=Ut  +  eUt- 1,  U,  ~WN(0,ct2)  and  \9\  >  1 . 

This  generates  an  impulse  response  function  with  respect  to  the  true  shocks 
of  the  system  equal  to  (1,  0,0, . . .).  The  above  mechanism  can,  however,  not  be 
the  Wold  Decomposition  because  the  noninvertibility  implies  that  Ut  cannot  be 
recovered  from  the  observation  of  {X,}.  As  shown  in  the  introduction,  there  is  an 
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observationally  equivalent  MA(1)  process,  i.e.  a  process  which  generates  the  same 
ACF.  Based  on  the  computation  in  Sect.  1.5,  this  MA(1)  process  is 

X,  =  Z,  +  0Zf_i ,  Z,  ~  WN(0,  CT2), 

with  6  =  6~x  and  cr2  =  [1^J,  a2.  This  is  already  the  Wold  Decomposition.  The 

impulse  response  function  for  this  process  is  (1 , 6~l ,  0 _ )  which  is  different  from 

the  original  system.  As  \8\  =  |0~!|  <  1,  the  innovations  {Z,}  can  be  recovered 
from  the  observations  as  Z,  =  ^2°Zo(—0yXt-j,  but  they  do  not  correspond  to  the 
shocks  of  the  system  {U,}.  Hansen  and  Sargent  (1991),  Quah  (1990),  and  Lippi  and 
Reichlin  (1993)  among  others  provide  a  deeper  discussion  and  present  additional 
more  interesting  economic  examples. 


3.3  Exponential  Smoothing 

Besides  the  method  of  least-squares  forecasting  exponential  smoothing  can  often  be 
seen  as  a  valid  alternative.  This  method  views  X,  as  a  function  of  time: 


Xt  —  f(t,  P)  +  st, 


whereby/(  t:  ft )  typically  represents  a  polynomial  in  t  with  coefficients  ji .  The  above 
equation  is  similar  to  a  regression  model  with  error  term  st.  This  error  term  is  usually 
specified  as  a  white  noise  process  {e,}  ~  WN(0,  cr2). 

Consider  first  the  simplest  case  where  X,  just  moves  randomly  around  a  fixed 
mean  ft.  This  corresponds  to  the  case  where /(f;  ft)  is  a  polynomial  of  degree  zero: 


Xt  —  ft  +  £t- 


If  ft  is  known  then  VrXr+h,  the  forecast  of  Xj+h  given  the  observations  XT, ...  ,X\, 
clearly  is  /3.  If,  however,  /3  is  unknown,  we  can  substitute  ji  by  Ay  .  the  average  of 
the  observations: 


^  „  _  ]  ' 
PtXt+i,  =  ft  =  Xt  =  —  TX„ 

t=  t 


where  means  that  the  model  parameter  /3  has  been  replaced  by  its  estimate.  The 
one-period  ahead  forecast  function  can  then  be  rewritten  as  follows: 


T-  1  -  1 

— —  ^t-\Xt  +  —Xt 

Vt-\Xt  +  —  ^Ay  —  Py_iAy^  . 


PyAy+i  = 
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The  first  equation  represents  the  forecast  for  T  +  1  as  a  linear  combination  of  the 
forecast  for  T  and  of  the  last  additional  information,  i.e.  the  last  observation.  The 
weight  given  to  the  last  observation  is  equal  to  1  /T  because  we  assumed  that  the 
mean  remains  constant  and  because  the  contribution  of  one  observation  to  the  mean 
is  l/T.  The  second  equation  represents  the  forecast  for  T  +  1  as  the  forecast  for  T 
plus  a  correction  term  which  is  proportional  to  the  last  forecast  error.  One  advantage 
of  this  second  representation  is  that  the  computation  of  the  new  forecast,  i.e.  the 
forecast  for  T+  1,  only  depends  on  the  forecast  for  T  and  the  additional  observation. 
In  this  way  the  storage  requirements  are  minimized. 

In  many  applications,  the  mean  does  not  remain  constant,  but  is  a  slowly  moving 
function  of  time.  In  this  case  it  is  no  longer  meaningful  to  give  each  observation 
the  same  weight.  Instead,  it  seems  plausible  to  weigh  the  more  recent  observation 
higher  than  the  older  ones.  A  simple  idea  is  to  let  the  weights  decline  exponentially 
which  leads  to  the  following  forecast  function: 


FVAr+i  —  - - —  ft)  Xj-t 

1  —  co1 

t=  o 


with  |ft)|  <  1. 


co  thereby  acts  like  a  discount  factor  which  controls  the  rate  at  which  agents  forget 
information.  1  —  co  is  often  called  the  smoothing  parameter.  The  value  of  co  should 
depend  on  the  speed  at  which  the  mean  changes.  In  case  when  the  mean  changes 
only  slowly,  co  should  be  large  so  that  all  observations  are  almost  equally  weighted; 
in  case  when  the  mean  changes  rapidly,  co  should  be  small  so  that  only  the  most 
recent  observations  are  taken  into  account.  The  normalizing  constant  jEE  ensures 
that  the  weights  sum  up  to  one.  For  large  T  the  term  coT  can  be  disregarded  so 
that  one  obtains  the  following  forecasting  function  based  on  simple  exponential 
smoothing : 


PyXy-i-i  (  1  —  ft)  )  [  Xj  ft) Ay—  |  -f-  Co  X'i  —O  +  .  .  .] 
=  ( 1  —  Cl))Xj  T  ft)  IP j—\Xj 
=  P j—\Xj  +  (1  —  ft))  (Xf  —  Py-iXy)  . 


In  the  economics  literature  this  forecasting  method  is  called  adaptive  expectation. 
Similar  to  the  model  with  constant  mean,  the  new  forecast  is  a  weighted  average 
between  the  old  forecast  and  the  last  (newest)  observation,  respectively  between  the 
previous  forecast  and  a  term  proportional  to  the  last  forecast  error. 

One  important  advantage  of  adaptive  forecasting  methods  is  that  they  can  be 
computed  recursively.  Starting  with  value  S0,  the  following  values  can  be  computed 
as  follows: 


PqXi  =  So 

P,X2  =  coP0Xi  +  (1  -co)Xi 
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P2A3  =  wPiZ2  +  (1  -  co)X2 


PtXt+1  —  CoPj—\Xj  +  (1  —  C0)X'f. 
Thereby  So  has  to  be  determined.  Because 


P7-X7-+1  =  (1  —  co)  [Xj-  +  coXj’—i  +  . . .  +  coT  1  ,Xi]  +  coT Sq, 


the  effect  of  the  starting  value  declines  exponentially  with  time.  In  practice,  we  can 
take  So  =  X\  or  So  =  XT.  The  discount  factor  co  is  usually  set  a  priori  to  be  a  number 
between  0.7  and  0.95.  It  is,  however,  possible  to  determine  co  optimally  by  choosing 
a  value  which  minimizes  the  mean  squared  one-period  forecast  error: 

T 

Y]  (X,  -  Pf— l-Xr)2  — »  min  . 

er  m<i 


From  a  theoretical  perspective  one  can  ask  the  question  for  which  class  of  models 
exponential  smoothing  represents  the  optimal  procedure.  Muth  (1960)  showed  that 
this  class  of  models  is  given  by 


AX,  =  X,  —  X,-\  =Z,-wZ,-l. 

Note  that  the  process  generated  by  the  above  equation  is  no  longer  stationary.  This 
has  to  be  expected  as  the  exponential  smoothing  assumes  a  non-constant  mean. 
Despite  the  fact  that  this  class  seems  rather  restrictive  at  first,  practice  has  shown 
that  it  delivers  reasonable  forecasts,  especially  in  situations  when  it  becomes  costly 
to  specify  a  particular  model.7  Additional  results  and  more  general  exponential 
smoothing  methods  can  be  found  in  Abraham  and  Ledolter  (1983)  and  Mertens 
and  Rassler  (2005). 


3.4  Exercises 

Exercise  3.4.1.  Compute  the  linear  least-squares  predictor  P jXt+h,  T  >  2,  and 
the  mean  squared  error  Vr(h),  h  =  1,  2,  3,  if{X,}  is  given  by  the  AR(2)  process 

X,  =  \3X,-\  -  0.4Z,_2  +  Z,  with  Z,  ~  WN(0,  2). 

To  which  values  do  PtXt+Ii  and  Vj(h)  converge  for  h  going  to  infinity? 


7  This  happens,  for  example,  when  many,  perhaps  thousands  of  time  series  have  to  be  forecasted  in 
a  real  time  situation. 
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Exercise  3.4.2.  Compute  the  linear  least-squares  predictor  Prpfr+p  ond  the  mean 
squared  error  Vj(\),T  =  0,  1, 2,  3,  if{X,}  is  given  by  the  MA(1)  process 

X,  =  Z,  +  0.8Zr_i  with  Z,  ~  WN(0, 2). 

To  which  values  do  f>TXr+h  and  Vj(h )  converge  for  h  going  to  infinity? 

Exercise  3.4.3.  Suppose  that  you  obser\>e  {X, }  for  the  two  periods  t  =  1  and  t  =  3, 
but  not  for  t  =  2. 

(i)  Compute  the  linear  least-squares  forecast  for  Xo  if 

X,  =  (f>X, -i  +  Z ,  with  |0|  <  1  and  Z,  ~  WN(0,  4) 

Compute  the  mean  squared  error  for  this  forecast. 

(ii)  Assume  now  that  {X,}  is  the  MA(  1 )  process 

X,  =  Z,  +  0Z,_  1  with  Z,  ~  WN(0,  4). 

Compute  the  mean  squared  error  for  the  forecast  ofXo. 

Exercise  3.4.4.  Let 


X,  =  A  cos  (cot)  +  B  sin(u;f) 

with  A  and  B  being  two  uncorrelated  random  variables  with  mean  zero  and  finite 
variance.  Show  that  {Xf}  satisfies  the  deterministic  difference  equation: 


X,  =  (2  cos  o))Xt—  ]  -X,-2. 


3.5  The  Partial  Autocorrelation  Function 

Consider  again  the  problem  of  forecasting  Xj+ 1  from  observations  Xt.Xj- i , 

. . . ,  X2.  Xi .  Denoting,  as  before,  the  best  linear  predictor  by  f’rXj+x  =  a  1  AY  + 
02Xt-\  +  ar-  1X2  +  arX  1,  we  can  express  Xj+\  as 


Xt+  1  =  ¥tXt+\  +  Zt+  1  =  aiXj  +  azXr-t  +  ar-fXz  +  ajX  \  +  Zt+\ 


where  ZT+  \  denotes  the  forecast  error  which  is  uncorrelated  with  AY, . . .  ,X\.  We 
can  now  ask  the  question  whether  X\  contributes  to  the  forecast  of  AY+ 1  controlling 
for  Xr,  Xj—2 , . . .  or,  equivalently,  whether  aT  is  equal  to  zero.  Thus,  ar  can  be 
viewed  as  a  measure  of  the  importance  of  the  additional  information  provided  by 
A' 1 .  It  is  referred  to  as  the  partial  autocorrelation.  In  the  case  of  an  AR(p)  process, 
the  whole  information  useful  for  forecasting  AY+i,  T  >  p,  is  incorporated  in  the 
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last  p  observations  so  that  a-y  =  0.  In  the  case  of  the  MA  process,  the  observations 
on  Xt,  . . . .  X\  can  be  used  to  retrieve  the  unobserved  Z/ ,  Zy  _ |  . . . ,  Z,-q+  \ .  As  Z, 
is  an  infinite  weighted  sum  of  past  X,’s,  every  new  observation  contributes  to  the 
recovering  of  the  Z,’s.  Thus,  the  partial  autocorrelation  aj  is  not  zero.  Taking  T 
successively  equal  to  0,  1,  2,  etc.  we  get  the  partial  autocorrelation  function  (PACF). 

We  can,  however,  interpret  the  above  equation  as  a  regression  equation.  From 
the  Frisch-Lovell- Waugh  Theorem  (See  Davidson  and  MacKinnon  1993),  we  can 
obtain  a-/  by  a  two-stage  procedure.  Project  (regress)  in  a  first  stage  X1  +  \  on 

Xt . X2  and  take  the  residual.  Similarly,  project  (regress)  Xj  on  XT . X2  and 

take  the  residual.  The  coefficient  aj  is  then  obtained  by  projecting  (regressing) 
the  first  residual  on  the  second.  Stationarity  implies  that  this  is  nothing  but  the 
correlation  coefficient  between  the  two  residuals. 


3.5.1  Definition 

The  above  intuition  suggests  two  equivalent  definitions  of  the  partial  autocorrelation 
function  (PACF). 

Definition  3.2  (Partial  Autocorrelation  Function  I).  The  partial  autocorrelation 
function  (PACF)  a(h),  h  =  0, 1, 2, . . .,  of  a  stationary  process  is  defined  as  follows: 

a(0)  =  1 

a(h)  =  a/„  h  =  1,2, , 

where  a /,  denotes  the  last  element  of  the  vector  a/,  =  rflyh(l)  =  (see 

Sect.  3.1  and  Eq.  (3.7)). 

Definition  3.3  (Partial  Autocorrelation  Function  II).  The  partial  autocorrelation 
function  (PACF)  a(h),  h  =  0, 1, 2, . . of  a  stationaiy  process  is  defined  as  follows: 

a(0)  =  1 

a(l)  =  coxx(X2,X\)  =  p(l) 

01(h)  =  coxx[Xll+l-P(Xlt+l\l.X2,....Xh),X1-¥(Xl\hX2,....Xh)], 


where  P  (X/,+i  1 1 ,  X2, ....  X/,)  and  P  PC  1 1 ,  X2, . . . ,  Xfi)  denote  the  best,  in  the  sense 
mean  squared  forecast  errors,  linear  forecasts  of  2f/i+i,  respectively  X\  given 
{l,X2,...,Xh}. 

Remark  3.4.  If  {A,}  has  a  mean  of  zero,  then  the  constant  in  the  projection  operator 
can  be  omitted. 

The  first  definition  implies  that  the  partial  autocorrelations  are  determined 
from  the  coefficients  of  the  forecasting  function  which  are  themselves  functions 
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of  the  autocorrelation  coefficients.  It  is  therefore  possible  to  express  the  partial 
autocorrelations  as  a  function  of  the  autocorrelations.  More  specifically,  the  partial 
autocorrelation  functions  can  be  computed  recursively  from  the  autocorrelation 
function  according  to  the  Durbin-Levinson  algorithm  (Durbin  1960): 


a(0)  =  1 

Qf(l)  =  au  =  p(  1) 


Pih)  X ^'=1  ah—ljPh—j 

i  x^h—  1 

1  Z-,j=\  ah-ljPj 


a(h)  =  ahh  = 


where  ahj  =  ah-\j  -  ahhah-\,h-j  for;  =  1,2 - ,h-  1. 

Autoregressive  Processes 

The  idea  of  the  PACF  can  be  well  illustrated  in  the  case  of  an  AR(1)  process 

Xt  =  <pXt- 1  +  Z,  with  0  <  \<p\  <  1  and  Z,  ~  WN(0,  a2). 

As  shown  in  Chap.  2,  X,  and  Xt-2  are  correlated  with  each  other  despite  the  fact 
that  there  is  no  direct  relationship  between  the  two.  The  correlation  is  obtained 
“indirectly”  because  X,  is  correlated  with  X,-\  which  is  itself  correlated  with  Xt-2- 
Because  both  correlation  are  equal  to  cp,  the  correlation  between  X,  and  X,-2  is  equal 
to  p( 2)  =  <p2.  The  ACF  therefore  accounts  for  all  correlation,  including  the  indirect 
ones.  The  partial  autocorrelation  on  the  other  hand  only  accounts  for  the  direct 
relationships.  In  the  case  of  the  AR(1)  process,  there  is  only  an  indirect  relation 
between  X,  and  Xt-h  for  h  >  2,  thus  the  PACF  is  zero. 

Based  on  the  results  in  Sect.  3.1  for  the  AR(1)  process,  the  definition  3.2  of  the 
PACF  implies: 


=►  a(l)  =  p(l)  =  (p, 
= ^  a  (2)  =  0, 

=►  a(3)  =  0. 


c(i  =  (p 

0(2  =  (<p.  0)' 

«3  =  (0,0,0)' 


The  partial  autocorrelation  function  of  an  AR(1)  process  is  therefore  equal  to  zero 
for  h  >  2. 

This  logic  can  be  easily  generalized.  The  PACF  of  a  causal  AR(p)  process  is 
equal  to  zero  for  h  >  p,  i.e.  a  (h)  =  0  for  h  >  p.  This  property  characterizes  an 
AR(p)  process  as  shown  in  the  next  section. 
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Moving-Average  Processes 

Consider  now  the  case  of  an  invertible  MA  process.  For  this  process  we  have: 

OO  OO 

z,  =  £  7tjX,-j  =>  xt  =  jtjXi—j + z,. 

7=0  7=1 

Z,  is  therefore  “directly”  correlated  with  each  Z,_/,,  h  =  1, 2, . . ..  Consequently, 
the  PACF  is  never  exactly  equal  to  zero,  but  converges  exponentially  to  zero.  This 
convergence  can  be  monotonic  or  oscillating. 

Take  the  MA(1)  process  as  an  illustration: 

X,  =  Z,  +  ez,- 1  with  \e\  <  1  and  Zr  ~  WN(0,  a2). 


The  computations  in  Sect.  3.1.2  showed  that 


e 

~  i  +  e2 

(  6>(i  +  02)  -e2  \ 

~  Vi  +  02  +  6»4'  i  +  e2  +  eA) 


=>■  <*(1)  =  P(l)  = 


6 

i +  e2’ 


=>■  a(2)  = 


-e2 

1  +  <92  +  6>4' 


Thus  we  get  for  the  MA(1)  process: 

(~d)h  _  {-d)h(\-e2) 

1  +  02  +  ...  +  0M  1-  02(A+1) 

3.5.2  Interpretation  of  ACF  and  PACF 

The  ACF  and  the  PACF  are  two  important  tools  to  determining  the  nature  of  the 
underlying  mechanism  of  a  stochastic  process.  In  particular,  they  can  be  used  to 
determine  the  orders  of  the  underlying  AR,  respectively  MA  processes.  The  analysis 
of  ACF  and  PACF  to  identify  appropriate  models  is  know  as  the  Box-Jenkins 
methodology  (Box  and  Jenkins  1976).  Table  3.2  summarizes  the  properties  of  both 
tools  for  the  case  of  a  causal  AR  and  an  invertible  MA  process. 

If  { X,  j  is  a  causal  and  invertible  ARMA(p,q)  process,  we  have  the  following 
properties.  As  shown  in  Sect.  2.4,  the  ACF  is  characterized  for  h  >  max{p,  q+  1 }  by 
the  homogeneous  difference  equation  p(h)  =  <p\p{h— 1)  +  . .  ,+<j>pp(h—p).  Causality 
implies  that  the  roots  of  the  characteristic  equation  are  all  inside  the  unit  circle.  The 
autocorrelation  coefficients  therefore  decline  exponentially  to  zero.  Whether  this 
convergence  is  monotonic  or  oscillating  depends  on  the  signs  of  the  roots.  The  PACF 
starts  to  decline  to  zero  for  h  >  p.  Thereby  the  coefficients  of  the  PACF  exhibit  the 
same  behavior  as  the  autocorrelation  coefficients  of  0-1(L)X,. 
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Table  3.2 

Properties  of  the  ACF  and  the  PACF 

Processes 

ACF 

PACF 

AR(p) 

Declines  exponentially 
(monotonically  or  oscillating)  to  zero 

a{h)  =  0  for  h  >  p 

MA(q) 

p(/i)  =  0  for  h  >  q 

Declines  exponentially 
(monotonically  or  oscillating)  to 
zero 

3.6  Exercises 

Exercise  3.6.1.  Assign  the  ACF  and  the  PACF  from  Fig.  3.1  to  the  following 
processes: 


X,  =  Z„ 


X,  =  0.9X,_!  +  Z„ 
Xt  =  Zf  0.8Z,_i, 


X,  =  0.9X,_1  +z,  +  0.8  Z,_! 


with  Z,  ~  WN(0,  a2). 
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Fig.  3.1  Autocorrelation  and  partial  autocorrelation  functions,  (a)  Process  1.  (b)  Process  2.  (c) 
Process  3.  (d)  Process  4 


Estimation  of  the  Mean 

and  the  Autocorrelation  Function 


4 


In  the  previous  chapters  we  have  seen  in  which  way  the  mean  /z,  and,  more 
importantly,  the  autocovariance  function,  y(h),  h  =  0,  ±1,  ±2, . . .,  of  a  stationary 
stochastic  process  {X,}  characterize  its  dynamic  properties,  at  least  if  we  restrict 
ourself  to  the  first  two  moments.  In  particular,  we  have  investigated  how  the 
autocovariance  function  is  related  to  the  coefficients  of  the  corresponding  ARMA 
process.  Thus  the  estimation  of  the  ACF  is  not  only  interesting  for  its  own  sake, 
but  also  for  the  specification  and  identification  of  appropriate  ARMA  models.  It  is 
therefore  of  outmost  importance  to  have  reliable  (consistent)  estimators  for  these 
entities.  Moreover,  we  want  to  test  specific  features  for  a  given  time  series.  This 
means  that  we  have  to  develop  corresponding  testing  theory.  As  the  small  sample 
distributions  are  hard  to  get,  we  rely  for  this  purpose  on  asymptotic  theory. 1 

In  this  section  we  will  assume  that  the  process  is  stationary  and  observed  for 
the  time  periods  t  =  We  will  refer  to  T  as  the  sample  size.  As 

mentioned  previously,  the  standard  sampling  theory  is  not  appropriate  in  the  times 
series  context  because  the  Xt's  are  not  independent  draws  from  some  underlying 
distribution,  but  are  systematically  related  to  each  other. 


4.1  Estimation  of  the  Mean 

The  arithmetic  average  constitutes  a  “natural”  estimator  of  the  mean  /z  of  the 
stochastic  process.  The  arithmetic  mean  Xy  is  defined  as  usual  by 


Recently,  bootstrap  methods  have  also  been  introduced  in  the  time  series  context. 
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It  is  immediately  clear  that  the  arithmetic  average  is  an  unbiased  estimator  of  the 
mean: 

E.XY  =  —  (EAi  -f~  ETC  “I-  -  -  -  “I-  EXY)  =  p. 

Of  greater  interest  are  the  asymptotic  properties  of  the  variance  of  the  arithmetic 
mean  YXy  which  are  summarized  in  the  following  theorem: 

Theorem  4.1  (Convergence  of  Arithmetic  Average).  If  J  X,  \  is  a  stationary  stochas¬ 
tic  process  with  mean  p  and  ACF  y(h)  then  the  variance  of  the  arithmetic  mean 
YXt  has  the  following  asymptotic  properties: 

YXt  =  E  (XT  ~  p)2  -+  0,  ify(T)  ->  0; 

OO  OO 

TYXt  =  TE  (XT  —  pf  y(h ),  if  ^  \y(h)\  <  00, 

h= — 00  h— — 00 

for  T  going  to  infinity. 

Proof.  Immediate  algebra  establishes: 

T 

0  <  TYXt  =  cov(X‘’*/)  =  E  f1  -  Y )  Y{h) 

ij=  1  |/i|<r  '  ' 

T 

<  X!  I^)!  =  2^2\yW\  +  y(0). 

\h\<T  h=l 


The  assumption  y(h)  — >  0  for  h  —*■  00  implies  that  for  any  given  s  >  0,  we  can  find 
Tq  such  that  \y(h)\  <  s/2  for  h  >  Tq.  If  T  >  Tq  and  T  >  27o  y(0)/s  then 


°-  f 

1  h=  1 


1  r°_1  1  ^ 

j  \y(h)\  +  -  Yl,  \Y(h)\ 

h=  1  1  h=T0 


< 


T0y(  Q) 

T 


+  I(r-7b)e/ 2< 


ro  y(0) 

7 


+  e/2  < 


7b  y(0)e  e 
27b  y(0)  2 


Therefore  YXt  converges  to  zero  for  T  — >  00  which  establishes  the  first  property. 
Moreover,  we  have 


lim  TYXt  =  lim  ^  (  1  -  ^  )  y(/f)  =  ^  y(/0  < 


OO. 


h=  — 00 


The  infinite  sum  YlhL-00  7(/0  converges  because  it  converges  absolutely  by 
assumption.  □ 
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This  Theorem  establishes  that  the  arithmetic  average  is  not  only  an  unbiased 
estimator  of  the  mean,  but  also  a  consistent  one.  In  particular,  the  arithmetic  average 
converges  in  the  mean-square  sense,  and  therefore  also  in  probability,  to  the  true 
mean  (see  appendix  C).  This  result  can  be  interpreted  as  a  reflection  of  the  concept 
of  ergodicity  (see  Sect.  1.2).  The  assumptions  are  relatively  mild  and  are  fulfilled 
for  the  ARMA  processes  because  for  these  processes  y(h)  converges  exponentially 
fast  to  zero  (see  Sect.  2.4.2,  in  particular  Eq.  (2.6)).  Under  little  more  restrictive 
assumptions  it  is  even  possible  to  show  that  the  arithmetic  mean  is  asymptotically 
normally  distributed. 

Theorem  4.2  (Asymptotic  Distribution  of  Sample  Mean).  For  any  stationary 
process  {X,}  given  by 


OO 

Xt  =  [i+  ^2  Z,~IID(0,a2), 

j=-oo 


such  that  ffj=_00  | ij/j\  <  oo  and  J2jZ-oo  V1}'  7^  0,  the  arithmetic  average  Xj  is 
asymptotically  normal: 


Vf(XT  -  pi) 


N  0, 


Y  y(h) 


h=—o o 


N 


o  V 


N(0,  cr2^(l)2) 


where  y  is  the  autocovariance  function  of{X,}. 


Proof.  The  standard  proof  invokes  the  Basic  Approximation  Theorem  C.14  and  the 
Central  Limit  Theorem  for  m-dependent  processes  C.13.  To  this  end  we  define  the 
2m-dependent  approximate  process 

m 

x, (m)  =  +  +Y  tz-p 

j——m 

For  we  have  Vm  =  f2h=-m  yV1)  =  °p2(5Zy'=-m  VO')2-  This  last  assertion  can 

be  verified  by  noting  that 

OO  OO  OO 

v  =  Y  =  °2  Y  Y  tt+h 

h=—  oo  h=—oo  j=—oo 

oo  oo 

=  °2  Y  ^  Y  t+h  =  v2 

j=—o o  h=—o o 


2 
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Note  that  the  assumption  f2j^-oo  |  V7/ 1  <  oo  guarantees  the  convergence  of  the 
infinite  sums.  Applying  this  result  to  the  special  case  i jrj  =  0  for  \j\  >  m,  we 
obtain  Vm. 

The  arithmetic  average  of  the  approximating  process  is 

t=\ 


The  CLT  for  m-dependent  processes  C.13  then  implies  that  for  T  oo 


Vf  (Xt”  -  )ij  — ►  X(m)  =  N(0,  Vm). 

As  m  — >■  oo,  o-2(E'l-m  VO')2  converges  to  0'2(^J^_oo  \jjj)2  and  thus 


X(m) 


X  =  N(0,  V)  =  N 


0,az 


This  assertion  can  be  established  by  noting  that  the  characteristic  functions  of 

approaches  the  characteristic  function  of  X  so  that  by  Theorem  C.  1 1  X(ni) - »  X. 

Finally,  we  show  that  the  approximation  error  becomes  negligible  as  T  goes  to 
infinity: 


Vf  (XT  -  n)  -  Vf  (x(",}  -/Vj  =T  1/2  -  x<im)) 

t=  1 


T~1/2  E e,<m) 

t=  1 


where  the  error  e,  is 


(in) 


=  E  vz- 


Clearly,  is  a  stationary  process  with  autocovariance  function  ye  such  that 

YVhL-oo  Ye(h)  =  o 2  (f2[j\>m  VO')  <  00 •  We  can  therefore  invoke  Theorem  4.1 
to  show  that 

V  ( Vf  (Xr  -  »)  -  Vf  (xS">  -  „))  =  TV  (I  £  «!"> 

V2  r=l 
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converges  to  a 2  y  [,| >m  V'n  as  T  — »  oo.  This  term  converges  to  zero  as  m  — »  oo. 

The  approximation  error  Vt  (AY  —  //)  —  \/T  (.Vy™'  —  // j  therefore  converges  in 
mean  square  to  zero  and  thus,  using  Chebyschev’s  inequality  (see  Theorem  C.3 
or  C.7),  also  in  probability.  We  have  therefore  established  the  third  condition  of 

Theorem  C.14  as  well.  Thus,  we  can  conclude  that  \/T  (XT  —  //) - ->  X.  □ 

Under  a  more  restrictive  summability  condition  which  holds,  however,  within 
the  context  of  causal  ARMA  processes,  we  can  provide  a  less  technical  proof.  This 
proof  follows  an  idea  of  Phillips  and  Solo  (1992)  and  is  based  on  the  Beveridge- 
Nelson  decomposition  (see  Appendix  D).2 

Theorem  4.3.  For  any  stationary  process 


x,  =  p  +  j2  fjz,-j 

j=  o 


with  the  properties  Zt  ~  IID(0,er2)  and  ^°^0y'2|i/f/'|2  <  oo,  the  arithmetic  average 
Xj  is  asymptotically  normal: 


Vf(x T-p)—^  n(o,  y(h) 


h= — oo 


/ 


N 


2\ 


ct2  (  J2  ^ 

j=  o 


N(0,ct2'T(1)2). 


Proof.  The  application  of  the  Beveridge-Nelson  decomposition  (see  Theorem  D.l 
in  Appendix  D)  leads  to 


1  T 

xt-F=  7Ef(L)Z' 


t=  1 


^(^-(L-l))^ 

1  t=  1 


+  y  T'  (L)  (Z0  —  ZT) 


Vf{xT-p)  =  4/(1) 


+  7f *(L)Zo  “  7f *(L)Zr- 


2The  Beveridge-Nelson  decomposition  is  an  indispensable  tool  for  the  understanding  of  integrated 
and  cointegrated  processes  analyzed  in  Chaps.  7  and  16. 
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The  assumption  Z,  ~  IID(0,  a2)  allows  to  invoke  the  Central  Limit  Theorem  C.12 

. —  s'T_  z 

of  Appendix  C  to  the  first  term.  Thus,  ~JT  ^t~1  '  is  asymptotical  normal  with  mean 
zero  and  variance  a2.  Theorem  D.l  also  implies  + ( 1)  <  oo.  Therefore,  the  term 

yT  g 

'T(l)  vT  Zj'°1  '  is  asymptotically  normal  with  mean  zero  and  variance  <t2  +  (  1  )2. 

The  variances  of  the  second  and  third  term  are  equal  to  j-  Y^J=o  V2/  •  The 
summability  condition  then  implies  according  to  Theorem  D.l  that  Y-J=o  ifj 
converges  for  T  —>  oo.  Thus,  the  variances  of  the  last  two  terms  converge  to  zero 
implying  that  these  terms  converge  also  to  zero  in  probability  (see  Theorem  C.7) 
and  thus  also  in  distribution.  We  can  then  invoke  Theorem  C.10  to  establish  the 
Theorem.  Finally,  the  equality  of  X^-oo  YW  and  a2  +  (l)2  can  be  obtained  from 
direct  computations  or  by  the  application  of  Theorem  6.4.  □ 

Remark  4.1.  Theorem  4.2  holds  with  respect  to  any  causal  ARMA  process  because 
the  i/r/s  converge  exponentially  fast  to  zero  (see  the  discussion  following  Eq.  (2.5)). 

Remark  4.2.  If  {A,}  is  a  Gaussian  process,  then  for  any  given  fixed  T ,  Xj  is 
distributed  as 

Vf(XT  -  n)  ~  N  |  0,  (l  -  y )  y(b) 

V  \h\<T  V  7 

According  to  Theorem  4.2,  the  asymptotic  variance  of  the  average  depends  on 
the  sum  of  all  covariances  y(h).  This  entity,  denoted  by  J,  is  called  the  long-run 
variance  of  {A,}: 


OO 

j  =  X! =  y(°) 

h=—o O 


1+2  X>(/,) 


h=  1 


(4.1) 


Note  that  the  long-run  variance  equals  2j r  times  the  spectral  density  /(A)  evaluated 
at  A  =  0  (see  the  Definition  6.1  of  the  spectral  density  in  Sect.  6.1). 

As  the  long-run  variance  takes  into  account  the  serial  properties  of  the  time 
series,  it  is  also  called  heteroskedastic  and  autocorrelation  consistent  variance  (HAC 
variance).  If  {A,}  has  some  nontrivial  autocorrelation  (i.e.  p(h)  /  0  for  h  /  0),  the 
long-run  variance  J  is  different  from  y(0).  This  implies  among  other  things  that  the 
construction  of  the  t-statistic  for  testing  the  simple  hypothesis  Ho:  /x  =  /x0  should 
be  based  on  J  rather  than  on  y(0). 

In  case  that  {A,}  is  a  causal  ARMA  process  with  <F(L)A,  =  @(L)Z,,  Z,  ~ 
WN(0,  cr2),  the  long-run  variance  is  given  by 


J  = 


T'(1)2<j2. 
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If  {Xt}  is  a  AR(1)  process  with  X,  =  <pX,-\  +  Zu  Zf  ~  WN(0,  a2)  and  \<j>\  <  1, 
y(0)  =  -^2  and  p(h)  =  cp'h' .  Thus  the  long-run  variance  is  given  by  J  =  ^  = 

y(0)  x  From  this  example  it  is  clear  that  the  long-run  variance  can  be  smaller 
or  larger  than  y (0),  depending  on  the  sign  of  cf>:  for  negative  values  of  </>.  y(0) 
overestimates  the  long-run  variance;  for  positive  values,  it  underestimates  J.  The 
estimation  of  the  long-run  variance  is  dealt  with  in  Sect.  4.4. 


4.2  Estimation  of  the  Autocovariance  and  the  Autocorrelation 
Function 


With  some  slight,  asymptotically  unimportant  modifications,  we  can  use  the  stan¬ 
dard  estimators  for  the  autocovariances,  y(h),  and  the  autocorrelations,  p(h),  of  a 
stationary  stochastic  process: 


1  T~h 

m  =  -j2{xt-xT)(xt+h-xT) 


t=  i 


p(h)  = 


9(h) 

9(0)' 


(4.2) 

(4.3) 


These  estimators  are  biased  because  the  sums  are  normalized  (divided)  by  T  rather 
than  T  —  h.  The  normalization  with  T  —  h  delivers  an  unbiased  estimate  only  if 
Xt  is  replaced  by  /i  which,  however  is  typically  unknown  in  practice.  The  second 
modification  concerns  the  use  of  the  complete  sample  for  the  estimation  of  ///  The 
main  advantage  of  using  the  above  estimators  is  that  the  implied  estimator  for  the 
covariance  matrix,  T  j,  respectively  the  autocorrelation  matrix,  R/\  of  (A) , . . . ,  Xt)'  , 


fT  = 


(  j>(0) 
y(i) 


y(l)  •  •  •  9(T  —  1)\ 

y(0)  ■  •  •  9(T  —  2) 


\9(T-  1)  9(T-2)  ...  y(0)  / 


Rj  = 


P  r 
y(0) 


always  delivers,  independently  of  the  realized  observations,  non-negative  definite 
and  for  y(0)  >  0  non-singular  matrices.  The  resulting  estimated  autocovariance 
function  will  then  satisfy  the  characterization  given  in  Theorem  1.1,  in  particular 
property  (iv). 


3  The  standard  statistical  formulas  would  suggest  to  estimate  the  mean  appearing  in  first  multipli¬ 
cand  from  Xi , . . . ,  X and  the  mean  appearing  in  the  second  multiplicand  from  X^+i , . . . ,  XT. 
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According  to  Box  and  Jenkins  (1976,  p.  33),  one  can  expect  reasonable  estimates 
for  y(h)  and  p(h)  if  the  sample  size  is  larger  than  50  and  if  the  order  of  the 
autocorrelation  coefficient  is  smaller  than  T /4. 

The  theorem  below  establishes  that  these  estimators  lead  under  rather  mild 
conditions  to  consistent  and  asymptotically  normally  distributed  estimators. 

Theorem  4.4  (Asymptotic  Distribution  of  Autocorrelations).  Let  {X,}  be  the 
stationary  process 

OO 

x,  =  p  +  fj  z‘-j 

i=-oo 

with  Z,  ~  IID(0.  a2),  IVO'I  <  00  an^  Y^jl-ooJ^^2  <  00  ■  Then  we  have  for 

h=  1,2,... 


/p(1)\ 

/ 

/p(l)\ 

— W  N 

WO/ 

V 

\P(h)J 

where  the  elements  ofW  =  (vt'’/;)1/€{  |  are  given  by  Bartlett's  formula 

OO 

w,j  =  J2[p(k  +  i )  +  p(k  -  i)  -  2 p(i)p(k)][p(k  +j)  +  p{k  -j)  -  2 p(j)p(k)]. 

k= 1 

Proof  Brockwell  and  Davis  (1991,  section  7.3)  □ 

Brockwell  and  Davis  (1991)  offer  a  second  version  of  the  above  theorem  where 
<  oo  is  replaced  by  the  assumption  of  finite  fourth  moments,  i.e. 
by  assuming  E Zf  <  oo.  As  we  rely  mainly  on  ARMA  processes,  we  do  not 
pursue  this  distinction  further  because  this  class  of  process  automatically  fulfills 
the  above  assumptions  as  soon  as  {Z,}  is  identically  and  independently  distributed 
(IID).  A  proof  which  relies  on  the  Beveridge-Nelson  polynomial  decomposition  (see 
Theorem  D.l  in  Appendix  D)  can  be  gathered  from  Phillips  and  Solo  (1992). 


Example:  {X,}  ~  IID(0,  a2) 

The  most  important  application  of  Theorem  4.4  is  related  to  the  case  of  a  white  noise 
process.  For  this  process  p(h)  is  equal  to  zero  for  \h\  >0.  Theorem  4.4  then  implies 
that 


1,  for  i  =  j\ 
0,  otherwise. 
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Fig.  4.1  Estimated  autocorrelation  function  of  a  WN(0,1)  process  with  95  %  confidence  interval 
for  sample  size  T  =  100 


The  estimated  autocorrelation  coefficients  converge  to  the  true  autocorrelation 
coefficient,  in  this  case  zero.  The  asymptotic  distribution  of  \ff  p(h )  converges  to 
the  standard  normal  distribution.  This  implies  that  for  large  T  we  can  approximate 
the  distribution  of  p(h)  by  a  normal  distribution  with  mean  zero  and  variance  l/T. 
This  allows  the  construction  of  a  95  %  confidence  interval  assuming  that  the  true 
process  is  white  noise.  This  confidence  interval  is  therefore  given  by  ±  1 .967“  5 .  It 
can  be  used  to  verify  if  the  observed  process  is  indeed  white  noise. 

Figure  4. 1  plots  the  empirical  autocorrelation  function  of  a  WN(0, 1 )  process  with 
a  sample  size  of  T  =  100.  The  implied  95  %  confidence  interval  is  therefore  equal 
to  ±0.196.  As  each  estimated  autocorrelation  coefficient  falls  within  the  confidence 
interval,  we  can  conclude  that  the  observed  times  series  may  indeed  represent  a 
white  noise  process. 

Instead  of  examining  each  correlation  coefficient  separately,  we  can  test  the  joint 
hypothesis  that  all  correlation  coefficients  up  to  order  N  are  simultaneously  equal 
to  zero,  i.e.  p(  1)  =  p( 2)  =  ...  =  p(N)  =  0,  N  =  1,2,...  As  each  \/Tp(h) 
has  an  asymptotic  standard  normal  distribution  and  is  ascmptotically  uncorrelated 
with  \/Tp(k),  h  7^  k ,  the  sum  of  the  squared  estimated  autocorrelation  coefficients 
is  x1  distributed  with  N  degrees  of  freedom.  This  test  statistic  is  called  Box-Pierce 
statistic: 

N 

q  =  tJ2p20i)  ~X2n- 

h=  1 

A  refinement  of  this  test  statistic  is  given  by  the  Ljung-Box  statistic : 

Q'  =  T(T  +  2)J2y^j7  ~  A 

h=  1 


(4.4) 
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This  test  statistic  is  also  asymptotically  x2  distributed  with  the  same  degree  of 
freedom  N.  This  statistic  accounts  for  the  fact  that  the  estimates  for  high  orders 
h  are  based  on  a  smaller  number  of  observations  and  are  thus  less  precise  and  more 
noisy.  The  two  test  statistics  are  used  in  the  usual  way.  The  null  hypothesis  that 
all  correlation  coefficients  are  jointly  equal  to  zero  is  rejected  if  Q ,  respectively  Q' 
is  larger  than  the  critical  value  corresponding  the  xh  distribution.  The  number  of 
summands  N  is  usually  taken  to  be  rather  large,  for  a  sample  size  of  150  in  the 
range  between  15  and  20.  The  two  test  are  also  referred  to  as  Portmanteau  tests. 


Example:  MA(q)  Process:^  =  Z,  +  0iZ,_i  +  . . .  +  6qZt-q 
with  Zt  ~  IID(0,  <r2) 

In  this  case  the  covariance  matrix  is  determined  as 

wn  =  1  +  2p(  l)2  +  . . .  +  2p{q)2  for  i  >  q. 

For  ij  >  q,  wq  is  equal  to  zero.  The  95  %  confidence  interval  for  the  MA(1)  process 
X,  =  Zt  —  0.8Z,_i  is  therefore  given  for  a  sample  size  of  T  =  200  by  ±  1 ,96T~  ?  [1  + 
2p(l)2]i  =  ±0.1684. 

Figure  4.2  shows  the  estimated  autocorrelation  function  of  the  above  MA(  1 )  pro¬ 
cess  together  with  95  %  confidence  interval  based  on  a  white  noise  process  and 
a  MA(1)  process  with  6  =  —0.8.  As  the  first  order  autocorrelation  coefficient  is 
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Fig.  4.2  Estimated  autocorrelation  function  of  a  MA(1)  process  with  6  =  —0.8  with  correspond¬ 
ing  95  %  confidence  interval  for  T  =  200 
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clearly  outside  the  confidence  interval  whereas  all  other  autocorrelation  coefficients 
are  inside  it,  the  figure  demonstrate  that  the  observations  are  evidently  the  realization 
of  MA(1)  process. 


Example:  AR(1 )  Process  X,  —  <t>Xt_x  =  Zt  with  Z,  ~  IID(0,  a 2) 


In  this  case  the  covariance  matrix  is  determined  as 


i  oo 

wu  =  <p2'  {<pk  -  +  X!  $2k  (0' _  ^~')2 

k=  1  k=i+ 1 


(i  -  d>2i)  (\  +  <p2) 
\-<p2 


-  2i<p2i 


1  +  <p2 
1  -<p2 


for  large  i. 


The  formula  for  Wy  with  i  ^  j  are  not  shown.  In  any  case,  this  formula  is  of 
relatively  little  importance  because  the  partial  autocorrelations  are  better  suited  for 
the  identification  of  AR  processes  (see  Sect.  3.5  and  4.3). 

Figure  4.3  shows  an  estimated  autocorrelation  function  of  an  AR(  1 )  process.  The 
autocorrelation  coefficients  decline  exponentially  which  is  a  characteristic  for  an 


Fig.  4.3  Estimated  autocorrelation  function  of  an  AR(1)  process  with  <j>  =  0.8  and  corresponding 
95  %  confidence  interval  for  T  =  100 
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AR(1)  process.4  Furthermore  the  coefficients  are  outside  the  confidence  interval  up 
to  order  8  for  white  noise  processes. 


4.3  Estimation  of  the  Partial  Autocorrelation  Function 

According  to  its  definition  (see  Definition  3.2),  the  partial  autocorrelation  of  order 
h,  a(h),  is  equal  to  cih,  the  last  element  of  the  vector  a/,  =  1  y/, ( 1 )  =  R^lph(  1). 
Thus,  a/,  and  consequently  ah  can  be  estimated  by  &/,  =  r^'1y/!(l)  =  R~^[  ph(\)- 
As  p(h)  can  be  consistently  estimated  and  is  asymptotically  normally  distributed 
(see  Sect.  4.2),  the  continuous  mapping  theorem  (see  Appendix  C)  ensures  that  the 
above  estimator  for  a(h)  is  also  consistent  and  asymptotically  normal.  In  particular 
we  have  for  an  AR(p)  process  (Brockwell  and  Davis  1991) 

Vfa(h)  — ^  N(0,  1)  for  T  — »  oo  and  h  >  p. 

This  result  allows  to  construct,  as  in  the  case  of  the  autocorrelation  coefficients, 
confidence  intervals  for  the  partial  autocorrelations  coefficients.  The  95  %  confi¬ 
dence  interval  is  given  by  ±^=.  The  AR(p)  process  is  characterized  by  the  fact  that 
the  partial  autocorrelation  coefficients  are  zero  for  h  >  p.  a  (h )  should  therefore  be 
inside  the  confidence  interval  for  h  >  p  and  outside  for  h  <  p.  Figure  4.4  confirms 
this  for  an  AR(1)  process  with  0  =  0.8. 


Fig.  4.4  Estimated  PACF  for  an  AR(1)  process  with  <f>  =  0.8  and  corresponding  95  %  confidence 
interval  for  T  =  200 


4 As  a  reminder:  the  theoretical  autocorrelation  coefficients  are  p(/i)  =  </;44 . 
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Fig.  4.5  Estimated  PACF  for  a  MA(1)  process  with  9  =  0.8  and  corresponding  95  %  confidence 
interval  for  T  =  200 


Figure  4.5  shows  the  estimated  PACF  for  an  MA(1)  process  with  6  =  0.8.  In 
conformity  with  the  theory,  the  partial  autocorrelation  coefficients  converge  to  zero. 
They  do  so  in  an  oscillating  manner  because  9  is  positive  (see  formula  in  Sect.  3.5). 


4.4  Estimation  of  the  Long-Run  Variance 

For  many  applications5  it  is  necessary  to  estimate  the  long-run  variance  J  which  is 
defined  according  to  Eq.  (4.1)  as  follows6 


OO  OO  /  OO  \ 

J=  J2  Y(h)  =  y(.0)  +  2j2y(h)  =  Y(0)  (l+2£>(A)  .  (4.5) 

h=-oo  h=  1  V  h=  1  / 

This  can,  in  principle,  be  done  in  two  different  ways.  The  first  one  consists  in  the 
estimation  of  an  ARMA  model  which  is  then  used  to  derive  the  implied  covariances 
as  explained  in  Sect.  2.4.  These  covariances  are  then  inserted  into  Eq.  (4.5).  The 
second  method  is  a  nonparametric  one  and  is  the  subject  for  the  rest  of  this  Section. 
It  has  the  advantage  that  it  is  not  necessary  to  identify  and  estimate  an  appropriate 
ARMA  model,  a  step  which  can  be  cumbersome  in  practice.  Additional  and  more 


5 For  example,  when  testing  the  null  hypothesis  Ho:  /r  =  /To  in  the  case  of  serially  correlated 

observations  (see  Sect.  4.1);  for  the  Phillips-Perron  unit-root  test  explained  in  Sect.  7.3.2. 
sSee  Theorem  4.2  and  the  comments  following  it. 
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advanced  material  on  this  topic  can  be  found  in  Andrews  (1991),  Andrews  and 
Monahan  (1992),  or  among  others  in  Haan  and  Levin  (1997). 7 

If  the  sample  size  is  T  >  1,  only  the  covariances  y(0), . . . ,  y(T  —  1)  can,  in 
principle,  be  estimated.  Thus,  a  first  naive  estimator  of  J  is  given  by  Jj  defined  as 

T- 1  7-1  /  7-1 

Jt=  9(h)  =  r(0)  +  2  9(h)  =  9(0)  I  1  +  2  p(h) 

h=-T+l  h=  1  V  /i=l 

where  9(h)  and  p(h)  are  the  estimators  for  y(h)  and  p(h),  respectively,  given 
in  Sect.  4.2.  As  the  estimators  of  the  higher  order  autocovariances  are  based  on 
smaller  samples,  their  estimates  become  more  erratic.  At  the  same  time,  their  weight 
in  the  above  sum  is  the  same  as  the  lower  order  and  more  precisely  estimated 
autocovariances.  Thus,  the  higher  order  autocovariances  have  a  disproportionate 
hazardous  influence  on  the  above  estimator. 

A  remedy  for  this  problem  is  to  use  only  a  certain  number  ij  of  autocovariances 
and/or  to  use  a  weighted  sum  instead  of  an  unweighted  one.  This  idea  leads  to  the 
following  class  estimators: 

T  ^  /  h  \ 

JT  =  Jt(It)  =  _  F,  h  (  —  )  9(h) , 

1  r/,=-r+ 1 

where  A:  is  a  weighting  or  kernel  function. 8  The  kernel  functions  are  required  to  have 
the  following  properties: 

(i)  k  :  R  — >  [—  1 , 1]  is,  with  the  exception  of  a  finite  number  of  points  a  continuous 
function.  In  particular,  k  is  continuous  at  x  =  0. 

(ii)  k  is  quadratically  integrable,  i.e.  /K  k(x)2dx  <  oo; 

(iii)  k(0)  =  1; 

(iv)  k  is  symmetric,  i.e.  k(x)  =  k(—x)  for  all  rel. 

The  basic  idea  of  the  kernel  function  is  to  give  relatively  little  weight  to  the  higher 
order  autocovariances  and  relatively  more  weight  to  the  smaller  order  ones.  As  k(0) 
equals  one,  the  variance  y( 0)  receives  weight  one  by  construction.  The  continuity 
assumption  implies  that  also  the  covariances  of  smaller  order,  i.e.  for  h  small,  receive 
a  weight  close  to  one.  Table  4. 1  lists  some  of  the  most  popular  kernel  functions  used 
in  practice. 

Figure  4.6  shows  a  plot  of  these  functions.  The  first  three  functions  are  nonzero 
only  for  |x|  <  1.  This  implies  that  only  the  orders  h  for  which  \h\  <  It  are 
taken  into  account.  lj  is  called  the  lag  truncation  parameter  or  the  bandwidth.  The 
quadratic  spectral  kernel  function  is  an  example  of  a  kernel  function  which  takes  all 


7Note  the  connection  between  the  long-run  variance  and  the  spectral  density  at  frequency  zero: 
]  =  2jrf(0)  where/  is  the  spectral  density  function  (see  Sect.  6.3). 

8Kemel  functions  are  also  relevant  for  spectral  estimators.  See  in  particular  Sect.  6.3. 
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Table  4.1  Common  kernel 
functions 


Name 

k(x)  = 

Boxcar  ("truncated”) 

1 

Bartlett 

1  -  \x\ 

Daniell 

sin(7T  x) 

7T  X 

Tukey-Hanning 

(1  +  cos(7rx))/2 

Quadratic  Spectral 

v£h  (  L/s  °  cos(6jtx/5)) 

The  function  are,  with  the  exception  of  the  quadratic 
spectral  function,  only  defined  for  \x\  <  1.  Outside  this 
interval  they  are  set  to  zero 


-1.5  -1  -0.5  0  0.5  1  1.5 

Fig.  4.6  Common  kernel  functions 


covariances  into  account.  Note  that  some  weights  are  negative  in  this  case  as  shown 
in  Fig.  4. 6. 9 

The  estimator  for  the  long-run  variance  is  subject  to  the  correction  term  - 
This  factor  depends  on  the  number  of  parameters  estimated  in  a  first  step  and  is  only 
relevant  when  the  sample  size  is  relatively  small.  In  the  case  of  the  estimation  of  the 
mean  r  would  be  equal  to  one  and  the  correction  term  is  negligible.  If  on  the  other 
hand  Xt,  t  =  1 , ....  T,  are  the  residuals  from  multivariate  regression,  r  designates 
the  number  of  regressors.  In  many  applications  the  correction  term  is  omitted. 

The  lag  truncation  parameter  or  bandwidth.  It,  depends  on  the  number  of 
observations.  It  is  intuitive  that  the  number  of  autocovariances  accounted  for  in 


^Phillips  (2004)  has  proposed  a  nonparametric  regression-based  method  which  does  not  require  a 
kernel  function. 
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the  computation  of  the  long-run  variance  should  increase  with  the  sample  size, 
i.e.  we  should  have  ij  — >  oo  for  T  — >  oo.10  The  relevant  issue  is,  at  which 
rate  the  lag  truncation  parameter  should  go  to  infinity.  The  literature  made  several 
suggestions.11  In  the  following  we  concentrate  on  the  Bartlett  and  the  quadratic 
spectral  kernel  because  these  function  always  deliver  a  positive  long-run  variance 
in  small  samples.  Andrews  (1991)  proposes  the  following  formula  to  determine  the 
optimal  bandwidth: 


Bartlett : 

Quadratic  Spectral  : 


lT  =  1 .1447  [(^Bartlett  3 


It  = 


1.3221 


[^QuadraticSpectral  T\ 


1 

5 


where  [.]  rounds  to  the  nearest  integer.  The  two  coefficients  aBartiett  and 
QjQuadraticSpectrai  are  data  dependent  constants  which  have  to  be  determined  in  a 
first  step  from  the  data  (see  (Andrews  1991,  832-839),  (Andrews  and  Monahan 
1992,  958)  and  (Haan  and  Levin  1997)).  If  the  underlying  process  is  approximated 
by  an  AR(1)  model,  we  get: 

_  4p2 

“Bamett-  (1  _p2)(1  +  p2) 

4  p2 

t^QuadraticSpectral  =  ^  , 

where  p  is  the  first  order  empirical  autocorrelation  coefficient. 

In  order  to  avoid  the  cumbersome  determination  of  the  cn’s  Newey  and  West 
(1994)  suggest  the  following  rules  of  thumb: 


Bartlett : 


Quadratic  Spectral : 


Q  —  /^Bartlett 


T  ' 

Too 


2 

9 


f  7  —  /^Quadratic  Spectral 


'  T  ' 

Too 


_2_ 

25 


It  has  been  shown  that  values  of  4  for  /fBartiett  as  well  as  for  /3QuadraticSpectrai  lead 
to  acceptable  results.  A  comparison  of  these  formulas  with  the  ones  provided  by 
Andrews  shows  that  the  latter  imply  larger  values  for  ij  when  the  sample  sizes  gets 


10This  is  true  even  when  the  underlying  process  is  known  to  be  a  MA(q)  process.  Even  in  this 
case  it  is  advantageous  to  include  also  the  autocovariances  for  h  >  q.  The  reason  is  twofold.  First, 
only  when  £t  oo  for  T  — >  oo,  do  we  get  a  consistent  estimator,  i.e.  Jj  —*■  Jt,  respectively  J. 
Second,  the  restriction  to  y(h),  \h\  <  q,  does  not  necessarily  lead  to  positive  value  for  the  estimated 
long-run  variance  Jt,  even  when  the  Bartlett  kernel  is  used.  See  Ogaki  (1992)  for  details. 
nSee  Haan  and  Levin  (1997)  for  an  overview. 
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larger.  Both  approaches  lead  to  consistent  estimates,  i.e.  Jt{It)  —  Jt  - >  0  for 

T  — »■  oo. 

In  practice,  a  combination  of  both  parametric  and  nonparametric  methods  proved 

to  deliver  the  best  results.  This  combined  method  consists  of  five  steps: 

(i)  The  first  step  is  called  prewhitening  and  consists  in  the  estimation  of  a 
simple  ARM  A  model  for  the  process  { X, }  to  remove  the  most  obvious  serial 
correlations.  The  idea,  which  goes  back  to  Press  and  Tukey  (1956)  (see  also 
Priestley  (1981)),  is  to  get  a  process  for  the  residuals  Z,  which  is  close  to  a 
white  noise  process.  Usually,  an  AR(1)  model  is  sufficient.12 

(ii)  Choose  a  kernel  function  and,  if  the  method  of  Andrews  has  been  chosen,  the 
corresponding  data  dependent  constants,  i.e.  Q!Bartiett  or  aQuadraticSpectrai  for  the 
Bartlett,  respectively  the  quadratic  spectral  kernel  function. 

(iii)  Compute  the  lag  truncation  parameter  for  the  residuals  using  the  above 
formulas. 

(iv)  Estimate  the  long-run  variance  for  the  residuals  Zt. 

(v)  Compute  the  long-run  variance  for  the  original  time  series  { X,  j . 

If  in  the  first  step  an  AR(1)  model,  X,  =  <pXt-\  +  Zt,  was  used,  the  last  step  is 

given  by: 


Jf^r) 

(l-^)2’ 


where  and  Jj(ir)  denote  the  estimated  long-run  variances  of  {X,\  and  J Z, } . 

In  the  general  case,  of  an  arbitrary  ARMA  model,  <f>(L)A,  =  (-)(L)Z,,  we  get: 


Jj(lT)  — 


Jj(It)- 


4.4.1  An  Example 

Suppose  we  want  to  test  whether  the  yearly  growth  rate  of  Switzerland’s  real  GDP  in 
the  last  25  years  was  higher  than  1  %.  For  this  purpose  we  compute  the  percentage 
change  against  the  corresponding  quarter  of  the  last  year  over  the  period  1982:1 
to  2006:1  (97  observations  in  total),  i.e.  we  compute  X,  =  (1  —  L4)  log(GDP,). 
The  arithmetic  average  of  these  growth  rates  is  1.4960  with  a  variance  of  3.0608 . 


12If  in  this  step  an  AR(  1 )  model  is  used  and  if  a  first  order  correlation  <j>  larger  in  absolute  terms  than 
0.97  is  obtained,  Andrews  and  Monahan  (1992,  457)  suggest  to  replace  (f>  by  —0.97,  respectively 
0.97.  Instead  of  using  a  arbitrary  fixed  value,  it  turns  out  that  a  data  driven  value  is  superior.  Sul 
et  al.  (2005)  suggest  to  replace  0.97  by  1  —  1/ \ff  and  —0.97  by  —  1  +  1/ -/f. 
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Fig.  4.7  Estimated  autocorrelation  function  for  Switzerland’s  real  GDP  growth  (percentage 
change  against  corresponding  last  year’s  quarter) 


We  test  the  null  hypothesis  that  the  growth  rate  is  smaller  than  one  against  the 
alternative  that  it  is  greater  than  one.  The  corresponding  value  of  the  t-statistic  is 
(1.4960  —  1)/ -y/3.0608/97  =  2.7922.  Taking  a  5  %  significance  level,  the  critical 
value  for  this  one-sided  test  is  1.661.  Thus  the  null  hypothesis  is  clearly  rejected. 

The  above  computation  is,  however,  not  valid  because  the  serial  correlation  of 
the  time  series  was  not  taken  into  account.  Indeed  the  estimated  autocorrelation 
function  shown  in  Fig.  4.7  clearly  shows  that  the  growth  rate  is  subject  to  high  and 
statistically  significant  autocorrelations. 

Taking  the  Bartlett  function  as  the  kernel  function,  the  rule  of  thumb  formula  for 
the  lag  truncation  parameter  suggest  lj  =  4.  The  weights  in  the  computation  of  the 
long-run  variance  are  therefore 


k(h/iT)  =  • 


1, 

h  =  0; 

3/4, 

h  =  ±1; 

2/4, 

h  =  ±2; 

1/4. 

h  =  ±3; 

0, 

|/;|  >  4. 

4.5  Exercises 


85 


The  corresponding  estimate  for  the  long-run  variance  is  therefore  given  by: 

JT  =  3.0608  ( 1  +  2-0.8287  +  2-0.6019  +  2-0.3727  |  =  9.2783. 

V  4  4  4  J 

Using  the  long-run  variance  instead  of  the  simple  variance  leads  to  a  quite  different 
value  of  the  t-statistic:  (1.4960  —  1)/ ^9. 2783/97  =  1.6037.  The  null  hypothesis 
is  thus  not  rejected  at  the  5  %  significance  level  when  the  serial  correlation  of  the 
process  is  taken  into  account. 


4.5  Exercises 

Exercise  4.5.1.  You  regress  100  realizations  of  a  stationary  stochastic  process  { X,  J 
against  a  constant  c.  The  least-squares  estimate  of  c  equals  c  =  004  with  an 
estimated  standard  deviation  of  a c  =  0.15.  In  addition,  you  have  estimated  the 
autocorrelation  function  up  to  order  h  =  5  and  obtained  the  following  values: 

p(l)  =  -0.43,  p( 2)  =  0.13,  p( 3)  =  -0.12,  p(4)  =  0.18,  p(5)  =  -0.23. 

(i)  How  do  you  interpret  the  estimated  parameter  value  of  0.4? 

(ii)  Examine  the  autocorrelation  function.  Do  you  think  that  {2ff}  is  white  noise? 

(iii)  Why  is  the  estimated  standard  deviation  bc  =  0.15  incorrect? 

(iv)  Estimate  the  long-run  variance  using  the  Bartlett  kernel. 

(v)  Test  the  null  hypothesis  that  {X,}  is  a  mean-zero  process. 


Estimation  of  ARMA  Models 


The  specification  and  estimation  of  an  ARMA(p,q)  model  for  a  given  realization 
involves  several  intermingled  steps.  First  one  must  determine  the  orders  p  and  q. 
Given  the  orders  one  can  then  estimate  the  parameters  dj  and  a2.  Finally,  the 
model  has  to  pass  several  robustness  checks  in  order  to  be  accepted  as  a  valid  model. 
These  checks  may  involve  tests  of  parameter  constancy,  forecasting  performance 
or  tests  for  the  inclusion  of  additional  exogenous  variables.  This  is  usually  an 
iterative  process  in  which  several  models  are  examined.  It  is  rarely  the  case  that  one 
model  imposes  itself.  All  too  often,  one  is  confronted  in  the  modeling  process  with 
several  trade-offs,  like  simple  versus  complex  models  or  data  fit  versus  forecasting 
performance.  Finding  the  right  balance  among  the  different  dimensions  therefore 
requires  some  judgement  based  on  experience. 

We  start  the  discussion  by  assuming  that  the  orders  of  the  ARMA  process 
is  known  and  the  problem  just  consists  in  the  estimation  of  the  corresponding 
parameters  from  a  realization  of  length  T.  For  simplicity,  we  assume  that  the  data 
are  mean  adjusted.  We  will  introduce  three  estimation  methods.  The  first  one  is  a 
method  of  moments  procedure  where  the  theoretical  moments  are  equated  to  the 
empirical  ones.  This  procedure  is  known  under  the  name  of  Yule- Walker  estimator. 
The  second  procedure  interprets  the  stochastic  difference  as  a  regression  model  and 
estimates  the  parameters  by  ordinary  least-squares  (OLS).  These  two  methods  work 
well  if  the  underlying  model  is  just  an  AR  model  and  thus  involves  no  MA  terms. 
If  the  model  comprises  MA  terms,  a  maximum  likelihood  (ML)  approach  must  be 
pursued. 


5.1  The  Yule-Walker  Estimator 

We  assume  that  the  stochastic  process  has  mean  zero  and  is  governed  by  a  causal 
purely  autoregressive  model  of  order  p: 

<t>(L)X,  =  Z,  with  Z,  ~  WN(0,  a2) 
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where  <I>(L)  =  1  — 0]  L— 0?L2  — . .  .—<j)pU’.  Causality  with  respect  to  { Z ,}  implies  that 
there  exists  a  sequence  {i/r,}  with  |t//)|  <  oo  such  that  X,  =  Ylj^o  = 

T(L)Z,.  Multiplying  the  above  difference  equation  by  X, -j,j  =  0,1, ...  ,p  and 
taking  expectations  leads  to  the  following  equation  system  for  the  parameters 
<5  =  (0i . (ppY  and  a2: 


y( 0)  -  0iy(i)  - ...  -  4>py(p)  =  a2 
y(i)  -  (piy(0)  - . . .  -  (j)PYip  -  l)  =  o 


yCp)  -  0i yip  -  l)  - ...  -  <ppy( 0)  =  o 


This  equation  system  is  known  as  the  Yule-Walker  equations.  It  can  be  written 
compactly  in  matrix  algebra  as: 

y(0)  -  4>VP(1)  =  o'2, 


/  y(0)  y(  1)  ...y(p*l)\ 

f0lf 

/  y  ( 1  )\ 

y( i)  y(0)  ■■■yip- 2) 

02 

= 

y(  2) 

\y(p- 1)  yip -2) ...  y(0)  ) 

v0Py 

\yip)) 

respectively 

y(0)  —  <f>VP(i)  =  o-2, 
rP4>  =  yp(  l). 

The  Yule-Walker  estimator  is  obtained  by  replacing  the  theoretical  moments 
by  the  empirical  ones  and  solving  the  resulting  equation  system  for  the  unknown 
parameters: 

®  =  r^1yp(l)  =  R~X  pp(Y) 

a2  =  y(0)  -  $'yp(l)  =  y(0)  (l  -  ppti)'#-1  pp(l)) 

Note  the  recursiveness  of  the  equation  system:  the  estimate  <!>  is  obtained  without 
knowledge  of  a2  as  the  estimator  Rp  1  pp ( 1 )  involves  only  autocorrelations.  The 

estimates  Tp,  Rp,  yp(  1 ),  pp(\ ),  and  y ( 0 )  are  obtained  in  the  usual  way  as  explained 
in  Chap.  4.1 


'Note  that  the  application  of  the  estimator  introduced  in  Sect.  4.2  guarantees  that  T,,  is  always 
invertible. 
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The  construction  of  the  Yule- Walker  estimator  implies  that  the  first  p  values 
of  the  autocovariance,  respectively  the  autocorrelation  function,  implied  by  the 
estimated  model  exactly  correspond  to  their  estimated  counterparts.  It  can  be  shown 
that  this  moment  estimator  always  delivers  coefficients  O  which  imply  that  {X,\  is 
causal  with  respect  to  {Z,}.  In  addition,  the  following  Theorem  establishes  that  the 
estimated  coefficients  are  asymptotically  normal. 

Theorem  5.1  (Asymptotic  Normality  of  Yule- Walker  Estimator).  Let  { X,  j  be  an 
AR(p)  process  which  is  causal  with  respect  to  {Z,}  whereby  {Z,}  ~  IID(0,  a2). 
Then  the  Yule-Walker  estimator  is  consistent  and  $  is  asymptotically  normal  with 
distribution  given  by: 


Vf  N  (0,  a2r“‘)  . 

In  addition  we  have  that 


Proof.  See  Brockwell  and  Davis  (1991,  233-234).  □ 

Noteworthy,  the  asymptotic  covariance  matrix  of  the  Yule- Walker  estimate  is 
independent  of  or2.  In  practice,  the  unknown  parameters  a2Vfl  are  replaced  by  their 
empirical  counterparts. 


Example:  AR(1 )  Process 

In  the  case  of  an  AR(1)  process,  the  Yule- Walker  equation  is  T  i  <3>  =  y\  (0)  which 
simplifies  to  y(0)(p  =  y(  1).  The  Yule- Walker  estimator  thus  becomes: 

^  ~  y(l) 

°  ^  =  P  1  ■ 

y(0) 


The  asymptotic  distribution  then  is 

Vt  -  <p)  N  (o,  j  =  N  (0,  1  -  (j,2-) . 

This  shows  that  the  assumption  of  causality,  i.e.  \<p\  <  1,  is  crucial.  Otherwise 
no  strictly  positive  value  for  the  variance  would  exist.  For  the  case  (j>  =  1 

which  corresponds  to  the  random  walk,  the  asymptotic  distribution  of  VT(cp  —  1) 
becomes  degenerate  as  the  variance  is  equal  to  zero.  This  case  is,  however,  of  prime 
importance  in  economics  and  is  treated  detail  in  Chap.  7. 

In  practice  the  order  of  the  model  is  usually  unknown.  However,  one  can  expect 
when  estimating  an  AR(m)  model  whereby  the  true  order  p  is  strictly  smaller  than  m 
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that  the  estimated  coefficients  <pp+\ . 4>m  should  be  close  to  zero.  This  is  indeed 

the  case  as  shown  in  Brockwell  and  Davis  (1991,  241).  In  particular,  under  the 
assumptions  of  Theorem  5. 1  it  holds  that 

Vrj)m  — ►  N(0,  1)  for  m  >  p.  (5.1) 

This  result  justifies  the  following  strategy  to  identify  the  order  of  an  AR-model. 
Estimate  in  a  first  step  a  highly  parameterized  model  (overfitted  model),  i.e.  a  model 
with  a  large  value  of  m ,  and  test  via  a  t-test  whether  (f>,„  is  zero.  If  the  hypothesis 
cannot  be  rejected,  reduce  the  order  of  the  model  from  m  to  m  —  1  and  repeat  the 
same  procedure  now  with  respect  to  <p,„- \ .  This  is  done  until  the  hypothesis  can  no 
longer  be  rejected. 

If  the  order  of  the  initial  model  is  too  low  (underfitted  model)  so  that  the  true 
order  is  higher  than  m,  one  incurs  an  “omitted  variable  bias”.  The  corresponding 
estimates  are  no  longer  consistent.  In  Sect.  5.4,  we  take  closer  look  at  the  problem 
of  determining  the  order  of  a  model. 


Example:  MA(q)  Process 

The  Yule-Walker  estimator  can,  in  principle,  also  be  applied  to  MA(q)  or 
ARMA(p,q)  processes  with  q  >  0.  However,  the  analysis  of  the  simple  MA(1) 
process  in  Sect.  1.5.1  showed  that  the  relation  between  the  autocorrelations  and  the 
model  parameters  is  nonlinear  and  may  have  two,  one,  or  no  solution.  Consider 
again  the  MA(1)  process  as  an  example.  It  is  given  by  the  stochastic  difference 
equation  X,  =  Z,  +  9Z,-\  with  Z,  ~  IID(0,a2).  The  Yule- Walker  equations  are 
then  as  follows: 


y  (0)  =  ct2(1  +  e2) 

yd)  =  a2e 

As  shown  in  Sect.  1.5.1,  this  system  of  equations  has  for  the  case  |p(l)|  = 
|y(l)/y(0)|  <  1/2  two  solutions;  for  |p(l) |  =  |y(l)/y(0)|  =  1 /2  one  solution; 
andfor  |/5(1)|  =  |y(l)/y(0)|  >  1 /2  no  real  solution.  In  the  case  of  several  solutions, 
we  usually  take  the  invertible  one  which  leads  to  1 6  \  <  1 .  Invertibility  is,  however, 
a  restriction  which  is  hard  to  implement  in  the  case  of  higher  order  MA  processes. 
Moreover,  it  can  be  shown  that  Yule-Walker  estimator  is  no  longer  consistent  in 
general  (see  Brockwell  and  Davis  (1991, 246)  for  details).  For  these  reasons,  it  is  not 
advisable  to  use  the  Yule- Walker  estimator  in  the  case  of  MA  processes,  especially 
when  there  exist  consistent  and  efficient  alternatives. 


5.2  OLS  Estimation  of  an  AR(p)  Model 
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5.2  Ordinary  Least-Squares  (OLS)  Estimation 
of  an  AR(p)  Model 

An  alternative  approach  is  to  view  the  AR  model  as  a  regression  model  for  X,  with 
regressors  X,-  \ , . . . ,  X,-p  and  error  term  Z 


X,  —  (p\X,-\  +  . . .  +  (f)pXf — p  +  Z„  Z,  ~  WN(0,  rr-). 

Given  observation  for  Xi , ....  Xj,  the  regression  model  can  be  compactly  written  in 
matrix  algebra  as  follows: 


(Xp+  A 

/  Xp  Xp—i  . 

•  X!  \ 

(cpx\ 

(Zp+  A 

Xp+2 

= 

xp+1  Xp  . 

.  X2 

<p2 

+ 

Zp+2 

V  XT  ) 

\X7-_1  Xj-2  ■ 

•  XT-p ) 

l  Z,  ) 

Y  =  X<f>  +  Z.  (5.2) 

Note  that  the  first  p  observations  are  lost  and  that  the  effective  sample  size  is  thus 
reduced  to  T  —  p.  The  least-squares  estimator  (OLS  estimator)  is  obtained  as  the 
minimizer  of  the  sum  of  squares  5(0): 

5(0)  =  Z'Z  ={Y-  XO)'(F  -  XO) 

T 

=  (X,  -  0iX,_!  -  . . .  -  <PpX,-pf 

t=p+ 1 
T 

=  Y  (X,  —  P,_iXf)2  — »  min .  (5.3) 

‘  J  $ 

t=P+ 1 

Note  that  the  optimization  problem  involves  no  constraints,  in  particular  causality  is 
not  imposed  as  a  restriction.  The  solution  of  this  minimization  problem  is  given  by 
usual  formula: 


O  =  (X'X)  1  (X'Y)  . 


Though  Eq.  (5.2)  resembles  very  much  an  ordinary  regression  model,  there  are 
some  important  differences.  First,  the  standard  orthogonality  assumption  between 
regressors  and  error  is  violated.  The  regressors  X,-j ,  j  =  1, . . .  ,p,  are  correlated 

with  the  error  terms  Z,_;j  =  1,2, _ Second,  there  is  a  dependency  on  the  starting 

values  Xp, ...  ,X\.  The  assumption  of  causality,  however,  insures  that  these  features 
do  not  play  a  role  asymptotically.  It  can  be  shown  that  (X,X)/T  converges  in 
probability  to  Tp  and  ( X’Y)/T  to  yp.  In  addition,  under  quite  general  conditions. 
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T~l/2X'Z  is  asymptotically  normally  distributed  with  mean  0  and  variance  a2Vp. 

Then  by  Slutzky’s  Lemma  C.10,  —  <t>)  =  (^f)  convel'ges  in 

distribution  to  N(0,  o2T~l)  Thus,  the  OLS  estimator  is  asymptotically  equivalent 
to  the  Yule- Walker  estimator. 

Theorem  5.2  (Asymptotic  Normality  of  the  Least-Squares  Estimator).  Under  the 
same  assumptions  as  in  Theorem  5.1,  the  ordinary  least-squares  estimator  (OLS 
estimator)  =  (X'X)  1  (X'F)  is  asymptotically  distributed  as 

VT  — d—+  N  (0,  a2r~l) , 

plim  s2  =  a 2 

where  s2  =  ZlZjT  and  Z,  are  the  OLS  residuals. 

Proof.  See  Chap.  13  and  in  particular  Sect.  13.3  for  a  proof  in  the  multivariate  case. 
Additional  details  may  be  gathered  from  Brockwell  and  Davis  (1991,  chapter  8). 

□ 

Remark  5.1.  In  practice  <T2r“'  is  approximated  by  s2  (X'X/T)-1.  Thus,  for  large 

T,  <1>  can  be  viewed  as  being  normally  distributed  as  N(<J>,  .y2  (X'X)-1 ).  This  result 
allows  the  application  of  the  usual  t-  and  F-tests. 

Because  the  regressors  Xt-j,  j  =  1, ...  ,p  are  correlated  with  the  errors  terms 
Zr_j,j  =  1,2,...,  the  Gauss-Markov  theorem  cannot  be  applied.  This  implies  that 
the  least-squares  estimator  is  no  longer  unbiased  in  finite  samples.  It  can  be  shown 
that  the  estimates  of  an  AR(1)  model  are  downward  biased  when  the  true  value  of 
0  is  between  zero  and  one.  MacKinnon  and  Smith  (1998,  figure  1)  plots  the  bias  as 
a  function  of  the  sample  size  and  the  true  parameter  (see  also  Fig.  7.1).  As  the  bias 
function  is  almost  linear  in  the  range  —0.85  <  <p  <  0.85,  an  approximately  unbiased 
estimator  for  the  AR(1)  model  has  been  proposed  by  Marriott  and  Pope  (1954), 
Kendall  (1954),  and  Orcutt  and  Winokur  (1969)  (for  further  details  see  MacKinnon 
and  Smith  1998): 

^corrected  —  ~  ~  (  70OI.S  “b  1 )  ■ 


Remark  5.2.  The  OLS  estimator  does  in  general  not  deliver  coefficients  <f>  for  which 
{X,}  is  causal  with  respect  {Z,j.  In  particular,  in  the  case  of  an  AR(1)  model,  it 
can  happen  that,  in  contrast  to  the  Yule- Walker  estimator,  |0|  is  larger  than  one 
despite  the  fact  that  the  true  parameter  is  absolutely  smaller  than  one.  Nevertheless, 
the  least-squares  estimator  is  to  be  preferred  in  practice  because  it  delivers  small- 
sample  biases  of  the  coefficients  which  are  smaller  than  those  of  Yule- Walker 
estimator,  especially  for  roots  of  <f>(z)  close  to  the  unit  circle  (Tjpstheim  and  Paulsen 
1983;  Shaman  and  Stine  1988;  Reinsel  1993). 


5.2  OLS  Estimation  of  an  AR(p)  Model 
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Appendix:  Proof  of  the  Asymptotic  Normality  of  the  OLS  Estimator 

The  proofs  of  Theorems  5.1  and  5.2  are  rather  involved  and  will  therefore  not 
be  pursued  here.  A  proof  for  the  more  general  multivariate  case  will  be  given  in 
Chap.  13.  It  is,  however,  instructive  to  look  at  a  simple  case,  namely  the  AR(1) 
model  with  \<f>\  <  1,  Z,  ~  IIN(0,  a2)  and  X(l  =  0.  Denoting  by  (/>■/  the  OLS  estimator 
of  (j) ,  we  have: 


Moreover,  Xt  can  be  written  as  follows: 

X,  =Z,  +  0Zf_1  +  ...+^-1Z1. 

By  assumption  each  Z;,  j  =  I .....  f,  is  normally  distributed  so  that  X,  as  a  sum 
normally  distributed  random  variables  is  also  normally  distributed.  Because  the  Z,’s 


The  expected  value  of  -E  Ef=  i  Xt~\ Z,  is  zero  because  Z,  ~  IIN(0,  a2).  The 
variance  of  this  expression  is  given  by 


Moreover,  =  E^t^-t  -  (X2  -  X2)  =  +  E,=i  Z?  + 

2 tp  E,=  |  Xt- 1 Z,  so  that 


The  expected  value  multiplied  by  o1  /T  thus  is  equal  to 


a2  E,ii  KZf  2 <p  E^=!  Xi-\Z, 

+  1  -<p2  T  +  1  -  cj)2  T 

CT4(1  —  <p2T)  (T4 

T(\~<p2)2  +  1  -02' 
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For  T  going  to  infinity,  we  finally  get: 


lim  V 

r-s-oo 


The  numerator  in  Eq.  (5.4)  therefore  converges  to  a  normal  random  variable  with 

4 

mean  zero  and  variance  -r2-^. 

i  -V 

The  denominator  in  Eq.  (5.4)  can  be  rewritten  as 


1 

T 


4 


-X\ 


+ 


1 


(1  -cj>2)T  (1  —  <p2)T 


2  (j) 


The  expected  value  and  the  variance  of  Xj/T  converge  to  zero.  Chebyschev’s 
inequality  (see  Theorem  C.3  in  Appendix  C)  then  implies  that  the  first  term 
converges  also  in  probability  to  zero.  Xq  is  equal  to  zero  by  assumption.  The  second 
term  has  a  constant  mean  equal  to  <r2/(l  —  <f>2)  and  a  variance  which  converges  to 
zero.  Theorem  C.8  in  Appendix  C  then  implies  that  the  second  term  converges  in 
probability  to  cr2/(  1  —  (jr  ).  The  third  term  has  a  mean  of  zero  and  a  variance  which 
converges  to  zero.  Thus  the  third  term  converges  to  zero  in  probability.  This  implies: 


t=  t 


p 

- >• 


O' 


2 


1  -<P2' 


Putting  the  results  for  the  numerator  and  the  denominator  together  and  applying 
Theorem  C.10  and  the  continuous  mapping  theorem  for  the  convergence  in  distri¬ 
bution  one  finally  obtains: 


Vf  (J>T  -  (/>)  — ^  N(0, 1  -  f-). 


Thereby  the  value  for  the  variance  is  derived  from 

1 


O' 


1  -  <j> 2 


(40' 


=  \-r- 


(5.5) 


5.3  Estimation  of  an  ARMA(p,q)  Model 

While  the  estimation  of  AR  models  by  OLS  is  rather  straightforward  and  leads  to 
consistent  and  asymptotically  efficient  estimates,  the  estimation  of  ARMA  models 
is  more  complex.  The  reason  is  that,  in  contrast  to  past  X,\  Zt,  Z,-\ , . . .  ,Zt-q  are 
not  directly  observable  from  the  data.  They  must  be  inferred  from  the  observations 
of  X,.  The  standard  method  for  the  estimation  of  ARMA  models  is  the  method  of 
maximum  likelihood  which  will  be  explained  in  this  section. 
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We  assume  that  the  process  {Xt}  is  a  causal  and  invertible  ARMA(p,q)  process 
following  the  difference  equation 

X,  —  4>\X,-\  —  ...  —  (ppXj—p  =  Zt  +  0 1 Z,_ i  +  . . .  +  6qZt~q 

with  Z,  ~  IID(0,  a2).  We  also  assume  that  <J>(z)  and  (-) (z )  have  no  roots  in  common. 
We  then  stack  the  parameters  of  the  model  into  a  vector  /I  and  a  scalar  a2: 

P  =  (0i, . . .  ,<f>p,  0i, . . . ,  6q)'  and  a1. 

Given  the  assumption  above  the  admissible  parameter  space  for  ft,  C,  is  described 
by  the  following  set: 

C  =  {fi€  W+<1  :  3>(z)©(z)  t2  0  for  |z|  <  1 ,  <pp9q  ±  0, 

<f>(z)  and  @(z)  have  no  roots  in  common} 

The  estimation  by  the  method  of  maximum  likelihood  (ML  method)  is  based 
on  some  assumption  about  the  joint  distribution  of  Xy  =  (Aj , . . . ,  Xj)'  given 
the  parameters  ft  and  a2.  This  joint  distribution  function  is  called  the  likelihood 
function.  The  method  of  maximum  likelihood  then  determines  the  parameters 
such  that  the  probability  of  observing  a  given  sample  Xy  =  (xj , . . . ,  xy)  is 
maximized.  This  is  achieved  by  maximizing  the  likelihood  function  with  respect 
to  the  parameters. 

By  far  the  most  important  case  is  given  by  assuming  that  {A,}  is  a  Gaussian 
process  with  mean  zero  and  autocovariance  function  y.  This  implies  that  Xy  = 
{X\ , . . . ,  Xt)'  is  distributed  as  a  multivariate  normal  with  mean  zero  and  variance 
Ty.2  The  Gaussian  likelihood  function  given  the  observations  Xy,  Ly(/3,  cr2|xy),  is 
then  given  by 


Ly(/3,  (72|xy)  =  (27r)_r^2(det  Ty)-1/2  exp  r^-Xyrjr'xyJ 

=  (2^i72)_:r^2(det  Gy)-1^2  exp  jX^Gy  *xr 

where  Gy  =  a_2Ty.  Note  that,  in  contrast  to  Ty,  Gy  does  only  depend  on  f>  and  not 
on  a2.3  If  one  wants  to  point  out  the  dependence  of  Gy  from  f  we  write  Gy  (ft).  The 
method  of  maximum  likelihood  then  consists  in  the  maximization  of  the  likelihood 
function  with  respect  to  ft  and  a2  taking  the  data  Xy  as  given. 


2If  the  process  does  not  have  a  mean  of  zero,  we  can  demean  the  data  in  a  preliminary  step. 

3In  Sect.  2.4  we  showed  how  the  autocovariance  function  y  and  as  a  consequence  Ty,  respectively 
Gy  can  be  inferred  from  a  given  ARMA  model,  i.e  from  a  given  ft. 
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The  first  order  condition  of  this  maximization  problem  with  respect  to  a2 
is  obtained  by  taking  the  logarithm  of  the  likelihood  function  Lt(/3,o2\xt)  and 
differentiating  with  respect  a2  and  setting  the  resulting  equation  equal  to  zero: 

ainLr(|6,CT2|xr)  _  T  1  X'rG-'Xr  _ 

da2  2  a2  2a4 

Solving  this  equation  with  respect  to  a2  we  get  as  the  solution:  a2  =  T~1x'tGj1xt- 
Inserting  this  value  into  the  original  likelihood  function  and  taking  the  logarithm, 
one  gets  the  concentrated  log-likelihood  function: 

lnLrOSIxr)  =  -ln(2^)-  En  (T^xx!tGt{^)~1xt)  -  ^  lndetGr(j8)  - 

This  function  is  then  maximized  with  respect  to  /I  e  C.  This  is,  however,  equivalent 
to  minimizing  the  function 

£t(P\xt)  =  In  (r_1x'rG7’(y6)_1X7’)  +  T~l  lndetG;r(/3)  — >  min  • 

fieC 

The  value  of  /I  which  minimizes  the  above  function  is  called  maximum-likelihood 
estimator  of  /3.  It  will  be  denoted  by  /3ml-  The  maximum-likelihood  estimator  for 
a1,  ct2l,  is  then  given  by 

®ml  =  T  xx!TGT{j5 ml)  Ixr- 

The  actual  computation  of  det Gy ( ft )  and  Gr(fi)~l  is  numerically  involved, 
especially  when  T  is  large,  and  should  therefore  be  avoided.  It  is  therefore 
convenient  to  rewrite  the  likelihood  function  in  a  different,  but  equivalent  form: 


Lr(0,<r2|xr)  =(27ra2)  T/2(r0rl  . . .  rT-i)  1/2 


exp 


E 


(Xt~P,-iX,)2\ 

r>- 1  y 


Thereby  P,_i X,  denotes  least-squares  predictor  of  X,  given  Xt-i, . . .  ,X\  and  r,  = 
vt/c> 2  where  vt  is  the  mean  squared  forecast  error  as  defined  in  Sect.  3.1.  Several 
numerical  algorithms  have  been  developed  to  compute  these  forecast  in  a  numeri¬ 
cally  efficient  and  stable  way.4 

Pf_iX/  and  r,  do  not  depend  on  o2  so  that  the  partial  differentiation  of  the 
log-likelihood  function  lnL(/l,  a2|xj)  with  respect  to  the  parameters  leads  to  the 
maximum  likelihood  estimator.  This  estimator  fulfills  the  following  equations: 


4One  such  algorithm  is  the  innovation  algorithm.  See  Brockwell  and  Davis  (1991,  section  5)  for 
details. 
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where 


T 


(X,  -  P,-!^)2 


s(/W)  =  22 


and  where  /3ml  denote  the  value  of  /3  which  minimizes  the  function 


=  ln^S()8)j  +  22  In  r,_, 


subject  to  P  €  C.  This  optimization  problem  must  be  solved  numerically.  In  practice, 
one  chooses  as  a  starting  value  /in  for  the  iteration  an  initial  estimate  such  that 
/So  e  C.  In  the  following  iterations  this  restriction  is  no  longer  imposed  to  enhance 
speed  and  reduce  the  complexity  of  the  optimization  problem.  This  implies  that  one 
must  check  whether  the  so  obtained  final  estimates  are  indeed  in  C. 

If  instead  of  T  y(  /S  |  Xy ) ,  the  function 


is  minimized  subject  to  constraint  /3  e  C,  we  obtain  the  least-squares  estimator  of 


/3  denoted  by  /Ils-  The  least-squares  estimator  of  a2,  <7 ,2S ,  is  then 


S(i Sls) 


T-p-q 


The  term  j.  Y2=  i  ln  rt- 1  disappears  asymptotically  because,  given  the  restriction 
/3  e  C,  the  mean-squared  forecast  error  Vj  converges  to  a2  and  thus  n  goes  to  one 
as  T  goes  to  infinity.  This  implies  that  for  T  going  to  infinity  the  maximization  of 
the  likelihood  function  becomes  equivalent  to  the  minimization  of  the  least-squares 
criterion.  Thus  the  maximum-likelihood  estimator  and  the  least-squares  estimator 
share  the  same  asymptotic  normal  distribution. 

Note  also  that  in  the  case  of  autoregressive  models  r,  is  constant  and  equal  to  one. 
In  this  case,  the  least-squares  criterion  S(/3)  reduces  to  the  criterion  (5.3)  discussed 
in  the  previous  Sect.  5.2. 


98 


5  Estimation  of  ARMA  Models 


Theorem  5.3  (Asymptotic  Distribution  of  ML  Estimator).  If  {A,}  is  an  ARMA 
process  with  true  parameters  ft  G  C  and  Z,  ~  IID(0,  o2)  with  a2  >  0  then  the 
maximum-likelihood  estimator  and  the  least-squares  estimator  have  asymptotically 
the  same  normal  distribution: 


Vf  (^ml  -  p)  — d-+  N  (0,  vm  , 
yf  T  (4lS  -  P)  — d—>  N  (0,  V(P)) . 


The  asymptotic  covariance  matrix  V(f)  is  thereby  given  by 


V(P)  = 


"Ku,u't  eu,v'\ 
EV,U't  e v,v;J 


Ut  —  (///,  Ut—  i .....  Uf—pjf. i) 

V,  =  (vt,  V,-u  ....  V,-q+x)' 


where  {u,}  and  {uf}  denote  autoregressive  processes  defined  as  <£>(L)ut  =  wt  and 
@(L)uf  =  wt  with  w,  ~  WN(0, 1). 

Proof.  See  Brockwell  and  Davis  (1991,  Section  8.8).  □ 

It  can  be  shown  that  both  estimators  are  asymptotically  efficient.3  Note  that  the 
asymptotic  covariance  matrix  V(/3 )  is  independent  of  o2. 

The  use  of  the  Gaussian  likelihood  function  makes  sense  even  when  the  process 
is  not  Gaussian.  First,  the  Gaussian  likelihood  can  still  be  interpreted  as  a  measure 
of  fit  of  the  ARMA  model  to  the  data.  Second,  the  asymptotic  distribution  is  still 
Gaussian  even  when  the  process  is  not  Gaussian  as  long  as  Z,  ~  IID(0,  a2). 
The  Gaussian  likelihood  is  then  called  the  quasi  Gaussian  likelihood.  The  use  of 
the  Gaussian  likelihood  under  this  circumstance  is,  however,  in  general  no  longer 
efficient. 


Example:  AR(p)  Process 

In  this  case  ft  =  (<p\ . <pp)  and  V(f)  =  (E  UtU't)  1  =  ct2  T”1 .  This  is.  however, 

the  same  asymptotic  distribution  as  the  Yule- Walker  estimator.  The  Yule- Walker,  the 
least-squares,  and  the  maximum  likelihood  estimator  are  therefore  asymptotically 
equivalent  in  the  case  of  an  AR(p)  process.  The  main  difference  lies  in  the  treatment 
of  the  first  p  observations. 


5See  Brockwell  and  Davis  (1991)  and  Fan  and  Yao  (2003)  for  details. 
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In  particular,  we  have: 


AR(1)  : 
AR(2)  : 


0~N(0,(i  -02)/r), 

0l\  ^  N  f  (^>1\  -(  1  02  -0l(l  + 

h)  V V02/  '  T  V— 0i ( 1  +  <t>2)  1-02  77 


Example:  MA(q)  Process 


Similarly,  one  can  compute  the  asymptotic  distribution  for  an  MA(q)  process.  In 
particular,  we  have: 


MA(1)  : 


0~N(0,(l-02)/7’), 


MA(2)  : 


i_(  i -el  0i(i-fc)\\ 

rV0i(i-02)  i -el_  )) 


Example:  ARMA(1,1)  Process 


For  an  ARMA(1,1)  process  the  asymptotic  covariance  matrix  is  given  by 


n<M) 


((i  -0T1  (i  +  4>er'\ 

[d  +  cper1  (i-02)  ' ) 


Therefore  we  have: 

/<A  N/70^  1  l+0g  /  (1  -02)(1  +  (pe)  -(1  -  g2)(l  -02)"\"\ 

\e)^  \\ej’T  ( cp  +  ey-  \-(i -e2)(i -</>2)  (i - e2)(i  +  <t>e) )) 


5.4  Estimation  of  the  Orders p  and  q 

Up  to  now  we  have  always  assumed  that  the  true  orders  of  the  ARMA  model  p  and  q 
are  known.  This  is,  however,  seldom  the  case  in  practice.  As  economic  theory  does 
usually  not  provide  an  indication,  it  is  all  too  often  the  case  that  the  orders  of  the 
ARMA  model  must  be  identified  from  the  data.  In  such  a  situation  one  can  make 
two  type  of  errors:  p  and  q  are  too  large  in  which  case  we  speak  of  overfitting;  p  and 
q  are  too  low  in  which  case  we  speak  of  underfitting. 

In  the  case  of  overfitting,  the  maximum  likelihood  estimator  is  no  longer 
consistent  for  the  true  parameter,  but  still  consistent  for  the  coefficients  of  the  causal 
representation  =  0, 1, 2, . . .,  where  i j/(z)  =  vpy.  This  can  be  illustrated  by  the 
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-1.5  -1  -0.5  0  0.5  1  1.5 

Fig.  5.1  Parameter  space  of  a  causal  and  invertible  ARMA(  1,1)  process 


following  example.  Suppose  that  { X, }  is  a  white  noise  process,  i.e.  Xt  =  Z,  ~ 
WN(0,  (T2),  but  we  fit  an  ARMA(1,1)  model  given  by  X,  —  <pXt-\  =  Z,  +  6Zt-\. 
Then,  the  maximum  likelihood  estimator  does  not  converge  to  0  =  9  =  0,  but 
only  to  the  line-segment  cp  =  —6  with  \cp\  <  1  and  \0\  <  1.  For  values  of  <p  and 
6  on  this  line-segment  we  have  i p(z)  =  9{z)/<p{z)  =  1.  The  maximum  likelihood 
estimator  converges  to  the  true  values  of  i jfj,  i.e.  to  the  values  t/r0  =  1  and  t fo  =  0  for 
j  >  0.  The  situation  is  depicted  in  Fig.  5.1.  There  it  is  shown  that  the  estimator  has 
a  tendency  to  converge  to  the  points  (—1, 1)  and  (1,-1),  depending  on  the  starting 
values.  This  indeterminacy  of  the  estimator  manifest  itself  as  a  numerical  problem 
in  the  optimization  of  the  likelihood  function.  Thus  models  with  similar  roots  for 
the  AR  and  MA  polynomials  which  are  close  in  absolute  value  to  the  unit  circle  are 
probably  overparametrized.  The  problem  can  be  overcome  by  reducing  the  orders 
of  the  AR  and  MA  polynomial  by  one. 

This  problem  does  not  appear  in  a  purely  autoregressive  models.  As  explained 
in  section  “Example:  AR(1)  Process”,  the  estimator  for  the  redundant  coefficients 
converges  to  zero  with  asymptotic  distribution  N(0, 1  /T)  (see  the  result  inEq.  (5.1)). 
This  is  one  reason  why  purely  autoregressive  models  are  often  preferred.  In  addition 
the  estimator  is  easily  implemented  and  every  stationary  stochastic  process  can  be 
arbitrarily  well  approximated  by  an  AR  process.  This  approximation  may,  however, 
necessitate  high  order  models  when  the  true  process  encompasses  a  MA  component. 

In  the  case  of  underfitting  the  maximum  likelihood  estimator  converges  to  those 
values  which  are  closest  to  the  true  parameters  given  the  restricted  parameter  space. 
The  estimates  are,  however,  inconsistent  due  to  the  “omitted  variable  bias”. 
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For  these  reasons  the  identification  of  the  orders  is  an  important  step.  One 
method  which  goes  back  to  Box  and  Jenkins  (1976)  consists  in  the  analysis  of 
the  autocorrelation  function  (ACF)  and  the  partial  autocorrelation  function  (PACF) 
(see  Sect.  3.5).  Although  this  method  requires  some  experience,  especially  when 
the  process  is  not  a  purely  AR  or  MA  process,  the  analysis  of  the  ACF  und  PACF 
remains  an  important  first  step  in  every  practical  investigation  of  a  time  series. 

An  alternative  procedure  relies  on  the  automatic  order  selection.  The  objective 
is  to  minimize  a  so-called  information  criterion  over  different  values  of  p  and 
q.  These  criteria  are  based  on  the  following  consideration.  Given  a  fixed  number 
of  observations,  the  successive  increase  of  the  orders  p  and  q  increases  the  fit 
of  the  model  so  that  variance  of  the  residuals  q  steadily  decreases.  In  order  to 
compensate  for  this  tendency  to  overfitting  a  penalty  is  introduced.  This  penalty 
term  depends  on  the  number  of  free  parameters  and  on  the  number  of  observations 
at  hand.6  The  most  important  information  criteria  have  the  following  additive  form: 

^  C(T)  a  C(T ) 

In  a p  +  (#  free  parameters) -  =  In  aqq  +  (p  +  q) -  — >  min, 

T  T  P’Q 

where  In  irqq  measures  the  goodness  of  fit  of  the  ARMA(p.q)  model  and  (p  +  q)^p- 
denotes  the  penalty  term.  Thereby  C(T)  represents  a  nondecreasing  function  of  T 
which  governs  the  trade-off  between  goodness  of  fit  and  complexity  (dimension) 
of  the  model.  Thus,  the  information  criteria  chooses  higher  order  models  for  larger 
sample  sizes  T.  If  the  model  includes  a  constant  term  or  other  exogenous  variables, 
the  criterion  must  be  adjusted  accordingly.  Flowever,  this  will  introduce,  for  a  given 
sample  size,  just  a  constant  term  in  the  objective  function  and  will  therefore  not 
influence  the  choice  of  p  and  q. 

The  most  common  criteria  are  the  Akaike  information  criterion  (AIC),  the 
Schwarz  or  Bayesian  information  criterion  (BIC),  and  the  Hannan-Quinn  informa¬ 
tion  criterion  (HQ  criterion): 


AIC  (p,  q)  =  In  apq  +  (p  +  q)- 


BIC(p,  q)  =  In  apq  +  (p  +  q)  ^ 


HQC(p,  q)  =  In  Op  q  +  (p  +  q) 


2 

T 

In  T 

~Y 

2  ln(ln  T) 


Because  AIC  <  HQC  <  BIC  for  a  given  sample  size  T  >  16,  Akaike’s  criterion 
delivers  the  largest  models,  i.e.  the  highest  order  p  +  q\  the  Bayesian  criterion  is 
more  restrictive  and  delivers  therefore  the  smallest  models,  i.e.  the  lowest  p  +  q. 
Although  Akaike’s  criterion  is  not  consistent  with  respect  to  p  and  q  and  has  a 


sSee  Brockwell  and  Davis  (1991)  for  details  and  a  deeper  appreciation. 
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tendency  to  deliver  overfitted  models,  it  is  still  widely  used  in  practice.  This  feature 
is  sometimes  desired  as  overfitting  is  seen  as  less  damaging  than  underfitting.  Only 
the  BIC  and  HQC  lead  to  consistent  estimates  of  the  orders  p  and  q. 


5.5  Modeling  a  Stochastic  Process 

The  identification  of  a  satisfactory  ARMA  model  typically  involves  in  practice 
several  steps. 


Step  1 :  Transformations  to  Achieve  Stationary  Time  Series 

Economic  time  series  are  often  of  a  non- stationary  nature.  It  is  therefore  necessary 
to  transform  the  time  series  in  a  first  step  to  achieve  stationarity.  Time  series  which 
exhibit  a  pronounced  trend  (GDP,  stock  market  indices,  etc.)  should  not  be  modeled 
in  levels,  but  in  differences.  If  the  variable  under  consideration  is  already  in  logs, 
as  is  often  case,  then  taking  first  differences  effectively  amounts  to  working  with 
growth  rates.  Sometimes  first  differences  are  not  enough  and  further  differences 
have  to  be  taken.  Price  indices  or  monetary  aggregates  are  typical  examples  where 
first  differences  may  not  be  sufficient  to  achieve  stationarity.  Thus  instead  of  X,  one 

works  with  the  series  Yt  =  (1  —  L)dX,  with  d  =  1,2, _ A  non- stationary  process 

{ X, }  which  needs  to  be  differentiated  cf-times  to  arrive  at  a  stationary  time  series 
is  called  integrated  of  order  d,  X,  ~  I(d)  r'  If  Yt  =  (1  —  L)dX,  is  generated  by  an 
ARMA(p.q)  process,  {X,}  is  said  to  be  an  ARIMA(p,d,q)  process. 

An  alternative  method  to  eliminate  the  trend  is  to  regress  the  time  series  against 
a  polynomial  in  t  of  degree  s,  i.e.  against  ( I .  f, . . . ,  f),  and  to  proceed  with  the 
residuals.  These  residuals  can  then  be  modeled  as  an  ARMA(p,q)  process.  Chapter  7 
discusses  in  detail  which  of  the  two  detrending  methods  is  to  be  preferred  under 
which  circumstances. 

Often  the  data  are  subject  to  seasonal  fluctuations.  As  with  the  trend  there  are 
several  alternative  available.  The  first  possibility  is  to  pass  the  time  series  through 
a  seasonal  filter  and  work  with  the  seasonally  adjusted  data.  The  construction  of 
seasonal  filters  is  discussed  in  Chap.  6.  A  second  alternative  is  to  include  seasonal 
dummies  in  the  ARMA  model.  A  third  alternative  is  to  take  seasonal  differences.  In 
the  case  of  quarterly  observations,  this  amounts  to  work  with  Y,  =  (1  —  L4)A,.  As 
1—  L4  =  (1—  L)(l+L-|-L2-|-L3),  this  transformation  involves  a  first  difference 
and  will  therefore  also  account  for  the  trend. 


7  See,  for  example,  the  section  on  the  unit  root  tests  7.3. 

8  An  exact  definition  will  be  provided  in  Chap.  7.  In  this  chapter  we  will  analyze  the  consequences 
of  non-stationarity  and  discuss  tests  for  specific  forms  of  non-stationarity. 
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Step  2:  Finding  the  Orders p  and  q 

Having  achieved  stationarity,  one  has  to  find  the  appropriate  orders  p  and  q  of  the 
ARMA  model.  Thereby  one  can  rely  either  on  the  analysis  of  the  ACF  and  the  PACF, 
or  on  the  information  criteria  outlined  in  the  previous  Sect.  5.4. 


Step  3:  Checking  the  Plausibility 

After  having  identified  a  particular  model  or  a  set  of  models,  one  has  to  inspect  its 
adequacy.  There  are  several  dimensions  along  which  the  model(s)  can  be  checked. 

(i)  Are  the  residuals  white  noise?  This  can  be  checked  by  investigating  at  the  ACF 
of  the  residuals  and  by  applying  the  Ljung-Box  test  (4.4).  If  they  are  not  this 
means  that  the  model  failed  to  capture  all  the  dynamics  inherent  in  the  data. 

(ii)  Are  the  parameters  plausible? 

(iii)  Are  the  parameters  constant  over  time?  Are  there  structural  breaks?  This  can 
be  done  by  looking  at  the  residuals  or  by  comparing  parameter  estimates  across 
subsamples.  More  systematic  approaches  are  discussed  in  Perron  (2006).  These 
involve  the  revolving  estimation  of  parameters  by  allowing  the  break  point 
to  vary  over  the  sample.  Thereby  different  type  of  structural  breaks  can  be 
distinguished.  A  more  in  depth  analysis  of  structural  breaks  is  presented  in 
Sect.  18.1. 

(iv)  Does  the  model  deliver  sensible  forecasts?  It  is  particularly  useful  to  investi¬ 
gate  the  out-of-sample  forecasting  performance.  If  one  has  several  candidate 
models,  one  can  perform  a  horse-race  among  them. 

In  case  the  model  turns  out  to  be  unsatisfactory,  one  has  to  go  back  to  steps  1 
and  2. 


5.6  An  example:  Modeling  Real  GDP  in  the  Case  of  Switzerland 

This  section  illustrates  the  concepts  and  ideas  just  presented  by  working  out  a 
specific  example.  We  take  the  seasonally  unadjusted  Swiss  real  GDP  as  an  example. 
The  data  are  plotted  in  Fig.  1.3.  To  take  the  seasonality  into  account  we  transform 
the  logged  time  series  by  taking  first  seasonal  differences,  i.e.  Xt  =  ( 1  — L4)  In  GDP,. 
Thus,  the  variable  corresponds  to  the  growth  rate  with  respect  to  quarter  of  the 
previous  year.  The  data  are  plotted  in  Fig.  5.2.  A  cursory  inspection  of  the  plot 
reveals  that  this  transformation  eliminated  the  trend  as  well  as  the  seasonality. 

First  we  analyze  the  ACF  and  the  PACF.  They  are  plotted  together  with 
corresponding  confidence  intervals  in  Fig.  5.3.  The  slowly  monotonically  declining 
ACF  suggests  an  AR  process.  As  only  the  first  two  orders  of  the  PACF  are 
significantly  different  from  zero,  it  seems  that  an  AR(2)  model  is  appropriate.  The 
least-squares  estimate  of  this  model  are: 
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1985  1990  1995  2000  2005  2010 


time 


Fig.  5.2  Real  GDP  growth  rates  of  Switzerland 


X,-  1.134  Xt-\  +  0.310  X,-2  =  0.218  +  Z,  with  a2  =  0.728 
(0,103)  (0.104) 

The  numbers  in  parenthesis  are  the  estimated  standard  errors  of  the  corresponding 
parameter  above.  The  roots  of  the  AR-polynomial  are  1.484  and  2.174.  They 
are  clearly  outside  the  unit  circle  so  that  there  exists  a  stationary  and  causal 
representation. 

Next,  we  investigate  the  information  criteria  AIC  and  BIC  to  identify  the  orders 
of  the  ARMA(p,q)  model.  We  examine  all  models  with  0  <  p,  q  <  4.  The  AIC  and 
the  BIC  values,  are  reported  in  Tables  5.1  and  5.2.  Both  criteria  reach  a  minimum 
at  (p ,  =  3)  (bold  numbers)  so  that  both  criteria  prefer  an  ARMA(1,3)  model. 

The  parameters  of  this  models  are  as  follows: 

X,  -  0.527  X,_!  =  0.6354  +  Z,  +  0.5106  Z,_, 

(0.134)  (0.1395) 

+  0.5611  Z,_2  +  0.4635  Z,_3  with  a2  =  0.648. 
(0.1233)  (0.1238) 

The  estimated  standard  errors  of  the  estimated  parameters  are  again  reported  in 
parenthesis  below.  The  AR(2)  model  is  not  considerably  worse  than  the  ARMA(1,3) 
model,  according  to  the  BIC  criterion  it  is  even  the  second  best  model. 
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autocorrelation  function  (ACF) 


partial  autocorrelation  function  (PACF) 


Fig.  5.3  Autocorrelation  (ACF)  and  partial  autocorrelation  (PACF)  function  (PACF)  of  real  GDP 
growth  rates  of  Switzerland  with  95  %  confidence  interval 


Table  5.1  Values  of 
Akaike’s  information 
criterium  (AIC)  for 
alternative  ARMA(p,q) 
models 


q 

p 

0 

1 

2 

3 

4 

0 

0.3021 

0.0188 

-0.2788 

-0.3067 

1 

-0.2174 

-0.2425 

-0.2433 

-0.3446 

-0.2991 

2 

-0.2721 

-0.2639 

-0.2613 

-0.3144 

-0.2832 

3 

-0.2616 

-0.2276 

-0.2780 

-0.2663 

-0.2469 

4 

-0.2186 

-0.1990 

-0.2291 

-0.2574 

-0.2099 

Minimum  in  bold 


The  inverted  roots  of  the  AR-  and  the  MA-polynomial  are  plotted  together  with 
their  corresponding  95  %  confidence  regions  in  Fig.  5. 4. 4  As  the  confidence  regions 
are  all  inside  the  unit  circle,  also  the  ARMA(1,3)  has  a  stationary  and  causal 
representation.  Moreover,  the  estimated  process  is  also  invertible.  In  addition,  the 
roots  of  the  AR-  and  the  MA-polynomial  are  distinct. 


9The  confidence  regions  are  determined  by  the  delta-method  (see  Appendix  E). 
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Table  5.2  Values  of  Bayes’ 
information  criterium  (BIC) 
for  alternative  ARMA(p,q) 
models 


Fig.  5.4  Inverted  roots  of  the 
AR-  and  the  MA-polynomial 
of  the  ARMA(1,3)  model 
together  with  the 
corresponding  95  % 
confidence  regions 


q 

p 

0 

1 

2 

3 

4 

0 

0.3297 

0.0740 

-0.1961 

-0.1963 

1 

-0.1896 

-0.1869 

-0.1600 

-0.2335 

-0.1603 

2 

-0.2162 

-0.1801 

-0.1495 

-0.1746 

-0.1154 

3 

-0.1772 

-0.1150 

-0.1373 

-0.0974 

-0.0499 

4 

-0.1052 

-0.0573 

-0.0591 

-0.0590 

0.0169 

Minimum  in  bold 


The  autocorrelation  functions  of  the  AR(2)  and  the  ARMA(1,3)  model  are 
plotted  in  Fig.  5.5.  They  show  no  sign  of  significant  autocorrelations  so  that  both 
residual  series  are  practically  white  noise.  We  can  examine  this  hypothesis  formally 
by  the  Ljung-Box  test  (see  Sect.  4.2  Eq.  (4.4)).  Taking  N  =  20  the  values  of  the 
test  statistics  are  G^r(2)  =  33.80  and  <2arma(i  3)  =  21.70,  respectively.  The  5% 
critical  value  according  to  the  distribution  is  31.41.  Thus  the  null  hypothesis 
/o(l)  =  ...  =  p(20)  =  0  is  rejected  for  the  AR(2)  model,  but  not  for  the 
ARMA(1,3)  model.  This  implies  that  the  AR(2)  model  does  not  capture  the  full 
dynamics  of  the  data. 

Although  the  AR(2)  and  the  ARMA(1,3)  model  seem  to  be  quite  different  at 
first  glance,  they  deliver  similar  impulse  response  functions  as  can  be  gathered  from 
Fig.  5.6.  In  both  models,  the  impact  of  the  initial  shock  is  first  built  up  to  values 
higher  than  1 . 1  in  quarters  one  and  two,  respectively.  Then  the  effect  monotonically 
declines  to  zero.  After  10  to  12  quarters  the  effect  of  the  shock  has  practically 
dissipated. 

As  a  final  exercise,  we  use  both  models  to  forecast  real  GDP  growth  over  the  next 
nine  quarters,  i.e.  for  the  period  fourth  quarter  2003  to  fourth  quarter  2005.  As  can 
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Fig.  5.5  Autocorrelation  function  (ACF)  of  the  residuals  from  the  AR(2)  and  the  ARMA(1,3) 
model 


Fig.  5.6  Impulse  responses  of  the  AR(2)  and  the  ARMA(1,3)  model 
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2001Q1  2001Q3  2002Q1  2002Q3  2003Q1  2003Q3  2004Q1  2004Q3  2005Q1  2005Q3  2006Q1 
Fig.  5.7  Forecasts  of  real  GDP  growth  rates  for  Switzerland 


be  seen  from  Fig.  5.7,  both  models  predict  that  the  Swiss  economy  should  move  out 
of  recession  in  the  coming  quarters.  However,  the  ARMA(1,3)  model  indicates  that 
the  recovery  is  taking  place  more  quickly  and  the  growth  overshooting  its  long-run 
mean  of  1.3%  in  about  a  year.  The  forecast  of  the  AR(2)  predicts  a  more  steady 
approach  to  the  long-run  mean. 


Spectral  Analysis  and  Linear  Filters 


6 


Up  to  now  we  have  viewed  a  time  series  as  a  time  indexed  sequence  of  random 
variables.  The  class  of  ARMA  process  was  seen  as  an  adequate  class  of  models 
for  the  analysis  of  stationary  time  series.  This  approach  is  usually  termed  as  time 
series  analysis  in  the  time  domain.  There  is,  however,  an  equivalent  perspective 
which  views  a  time  series  as  overlayed  waves  of  different  frequencies.  This  view 
point  is  termed  in  time  series  analysis  as  the  analysis  in  the  frequency  domain. 
The  decomposition  of  a  time  series  into  sinusoids  of  different  frequencies  is  called 
the  spectral  representation.  The  estimation  of  the  importance  of  the  waves  at 
particular  frequencies  is  referred  to  as  spectral  or  spectrum  estimation.  Priestley 
(1981)  provides  an  excellent  account  of  these  methods.  The  use  of  frequency  domain 
methods,  in  particular  spectrum  estimation,  which  originated  in  the  natural  sciences 
was  introduced  to  economics  by  Granger  (1964). 1  Notably,  he  showed  that  most  of 
the  fluctuations  in  economic  time  series  can  be  attributed  low  frequencies  cycles 
(Granger  1966). 

Although  both  approaches  are  equivalent,  the  analysis  in  the  frequency  domain  is 
more  convenient  when  it  comes  to  the  analysis  and  construction  of  linear  filters.  The 
application  of  a  filter  to  a  time  series  amounts  to  take  some  moving-average  of  the 
time  series.  These  moving-average  may  extend,  at  least  in  theory,  into  the  infinite 
past,  but  also  into  the  infinite  future.  A  causal  ARMA  process  { X,  j  may  be  regarded 
as  filtered  white-noise  process  with  filter  weights  given  by  fj,  j  =  1,2, .. .  In 
economics,  filters  are  usually  applied  to  remove  cycles  of  a  particular  frequency,  like 
seasonal  cycles  (for  example  Christmas  sales  in  a  store),  or  to  highlight  particular 
cycles,  like  business  cycles. 


'The  use  of  spectral  methods  in  the  natural  sciences  can  be  traced  many  centuries  back.  The  modem 
statistical  approach  builds  on  to  the  work  of  N.  Wiener,  G.  U.  Yule,  J.  W.  Tukey,  and  many  others. 
See  the  interesting  survey  by  Robinson  (1982). 
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From  a  mathematical  point  of  view,  the  equivalence  between  time  and  frequency 
domain  analysis  rest  on  the  theory  of  Fourier  series.  An  adequate  representation 
of  this  theory  is  beyond  the  scope  of  this  book.  The  interested  reader  may 
consult  Brockwell  and  Davis  (1991,  chapter  4).  An  introduction  to  the  underlying 
mathematical  theory  can  be  found  in  standard  textbooks  like  Rudin  (1987). 


6.1  The  Spectral  Density 

In  the  following,  we  assume  that  {X,}  is  a  mean-zero  (centered)  stationary  stochastic 
process  with  autocovariance  function  y(h),  h  =  0,  ±1,  ±2, . . .  Mathematically, 
y(h)  represents  an  double-infinite  sequence  which  can  be  mapped  into  a  real 
valued  function /(A),  A  e  1,  by  the  Fourier  transform.  This  function  is  called 
the  spectral  density  function  or  spectral  density.  Conversely,  we  retrieve  from  the 
spectral  density  each  covariance.  Thus,  we  have  a  one-to-one  relation  between 
autocovariance  functions  and  spectral  densities:  both  objects  summarize  the  same 
properties  of  the  time  series,  but  represent  them  differently. 

Definition  6.1  (Spectral  Density).  Let  { X,  j  be  a  mean-zero  stationary  stochastic 
process  absolutely  summable  autocovariance  function  y  then  the  function 


(6.1) 


is  called  the  spectral  density  function  or  spectral  density  of{X,}.  Thereby  i  denotes 
the  imaginary  unit  (see  Appendix  A). 

The  sine  is  an  odd  function  whereas  the  cosine  and  the  autocovariance  function 
are  even  functions.2  This  implies  that  the  spectral  density  can  be  rewritten  as: 


(6.2) 


h= 1 


2  A  function/  is  called  even  if /( — jc)  =  /(.t);  the  function  is  called  odd  if/( — jc)  =  —fix).  Thus, 
we  have  sin(— 9)  =  —  sin(d)  and  cos (—9)  =  cos(0). 
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Because  of  the  periodicity  of  the  cosine  function,  i.e.  because 
/(A  +  2krc)  =  /(A),  for  all  k  e  Z, 

it  is  sufficient  to  consider  the  spectral  density  only  in  the  interval  {—it,  it\.  As  the 
cosine  is  an  even  function  so  is  /.  Thus,  we  restrict  the  analysis  of  the  spectral 
density /(A)  further  to  the  domain  A  e  [0,  it}. 

In  practice,  we  often  use  the  period  or  oscillation  length  instead  of  the  radiant  A. 
They  are  related  by  the  formula: 

2  it 

period  length  =— .  (6.3) 

A 

If,  for  example,  the  data  are  quarterly  observations,  a  value  of  0.3  for  A  corresponds 
to  a  period  length  of  approximately  2 1  quarters. 


Remark  6.1.  We  gather  some  properties  of  the  spectral  density  function/: 

•  Because /(0)  =  -A.  Y^hL-oo  yQ1)>  the  long-run  variance  of  { X ,}  J  (see  Sect.  4.4) 
equals  27t/(0),  i.e.  2itf{0)  =  J. 

•  /  is  an  even  function  so  that /(A)  =  /(—A). 

•  /(A)  >  0  for  all  A  e  (— it,  n\.  The  proof  of  this  proposition  can  be  found 
in  Brockwell  and  Davis  (1996,  chapter  4).  This  property  corresponds  to  the 
non-negative  definiteness  of  the  autocovariance  function  (see  property  4  in 
Theorem  1.1  of  Sect.  1.3). 

•  The  single  autocovariances  are  the  Fourier-coefficients  of  the  spectral  density/: 


Y(h)  =  r  e',iA/( A)dA  =  f 

J —71  J  —71 


cos(/jA)/(A)dA. 


For  h  =  0,  we  therefore  get  y(0)  =  f*nf( A)dA. 


The  last  property  allows  us  to  compute  the  autocovariances  from  a  given  spectral 
density.  It  shows  how  time  and  frequency  domain  analysis  are  related  to  each  other 
and  how  a  property  in  one  domain  is  reflected  as  a  property  in  the  other. 

These  properties  of  a  non-negative  definite  function  can  be  used  to  characterize 
the  spectral  density  of  a  stationary  process  {X,}  with  autocovariance  function  y. 

Theorem  6.1  (Properties  of  a  Spectral  Density).  A  function  f  defined  on  {—it,  it] 
is  the  spectral  density  of  a  stationary  process  if  and  only  if  the  following  properties 
hold: 


/(A)  =/(— A); 
/(A)  >  0; 
f-„fM dA  <  00. 
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Corollary  6.1.  An  absolutely  summable  function  y  is  the  autocovariance  function 
of  a  stationary  process  if  and  only  if 


/(A)  =  —  Y  y(h)e~,hX  >  0,  for  all  X  e  (— jr,  it], 

h=—oo 

In  this  case  f  is  called  the  spectral  density  of  y. 


The  function  f(X)/y(0)  can  be  considered  as  a  density  function  of  some 
probability  distribution  defined  on  [— jr,  jr]  because  >  0  and 


m 

y(0) 


dA  =  1. 


The  corresponding  cumulative  distribution  function  G  is  then  defined  as: 


G(A)  =  /  - a CO,  —71  <  A  <  7T. 

J-n  y(0) 

It  satisfies:  G(-t r)  =  0,  G(jt)  =  1,  1  -  G(A)  =  G(A),  und  G(0)  =  1/2.  The 
autocorrelation  function  p  is  then  given  by 


P(h)  = 


Y(h ) 

y(0) 


e,/,AdG(A). 


Some  Examples 


Some  relevant  examples  illustrating  the  above  are: 

white  noise:  Let  {X,}  be  a  white  noise  process  with  X,  ~  WN(0,cr2).  For  this 
process  all  autocovariances,  except  y(0),  are  equal  to  zero.  The  spectral  density 
therefore  is  equal  to 


m 


oo 

£  r,h>e" 


hX 


h=—o o 


y(  0)  = 

2^  2^ 


Thus,  the  spectral  density  is  equal  to  a  constant  which  is  proportional  to  the 
variance.  This  means  that  no  particular  frequency  dominates  the  spectral  density. 
This  is  the  reason  why  such  a  process  is  called  white  noise. 

MA(1):  Let  {X,}  be  a  MA(  1 )  process  with  autocovariance  function 


y(h)  = 


1,  h  =  0; 
p,  h  =  ±1; 
0,  otherwise. 
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The  spectral  density  therefore  is: 


m  =  —  Y  y(/i)e_,“  =  +  '  +  ^  =  1  +  2pc°sA. 

h=—oo 

Thus,  /(A)  >  0  if  and  only  p  <  1/2.  According  to  Corollary  6.1  above,  y 
is  the  autocovariance  function  of  a  stationary  stochastic  process  if  and  only  if 
|p|  <  1/2.  This  condition  corresponds  exactly  to  the  condition  derived  in  the  time 
domain  (see  Sect.  1.3).  The  spectral  density  for  p  =  0.4  or  equivalently  6  =  0.5, 
respectively  for  p  =  —0.4  or  equivalently  6  =  —0.5,  and  <r2  =  1  is  plotted  in 
Fig.  6.1a.  As  the  process  is  rather  smooth  when  the  first  order  autocorrelation  is 
positive,  the  spectral  density  is  large  in  the  neighborhood  of  zero  and  small  in  the 
neighborhood  of  n.  For  a  negative  autocorrelation  the  picture  is  just  reversed. 

AR(1):  The  spectral  density  of  an  AR(1)  process  X,  =  0X,_i  +  Z,  with  Z,  ~ 
WN(0,  a2)  is: 


/(A)  =  ^(1  +  fy  (e~,,a  +  e',,A) 

\  h=  1 

a2  /  (f>e'x  <pe~'x  \  _  a1  1 

27t(  1  —  cf>2)  l  +  1- 0e'A  +  1  —  ^e-'2-  /  2n  1  —  2cp  cos  A  +  <p2 

The  spectral  density  for  <p  =  0.6  and  0  =  —0.6  and  a2  =  1  are  plotted 
in  Fig.  6.1b.  As  the  process  with  0  =  0.6  exhibits  a  relatively  large  positive 
autocorrelation  so  that  it  is  rather  smooth,  the  spectral  density  takes  large  values 
for  low  frequencies.  In  contrast,  the  process  with  0  =  —0.6  is  rather  volatile 
due  to  the  negative  first  order  autocorrelation.  Thus,  high  frequencies  are  more 
important  than  low  frequencies  as  reflected  in  the  corresponding  figure. 

Note  that,  as  0  approaches  one,  the  spectral  density  evaluated  at  zero  tends  to 
infinity,  i.e.  lim^  i0/(A)  =  oo.  This  can  be  interpreted  in  the  following  way. 
As  the  process  gets  closer  to  a  random  walk  more  and  more  weight  is  given  to 
long-run  fluctuations  (cycles  with  very  low  frequency  or  very  high  periodicity) 
(Granger  1966). 


6.2  Spectral  Decomposition  of  a  Time  Series 

Consider  the  simple  harmonic  process  {A,}  which  just  consists  of  a  cosine  and  a  sine 
wave: 


X,  =  A  cos(o)  t)  +  B  sin(o)  t). 


(6.4) 
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a 


x 


b 


x 


Fig.  6.1  Examples  of  spectral  densities  with  Z,  ~  WN(0,  1).  (a)  MA(1)  process,  (b)  AR(1) 
process 


Thereby  A  and  B  are  two  uncorrelated  random  variables  with  EA  =  E/1  =  0  and 
VA  =  V B  =  1.  The  autocovariance  function  of  this  process  is  y(h)  =  cos(ft>  h). 
This  autocovariance  function  cannot  be  represented  as  f*  e!hXf( A)dA.  However,  it 
can  be  regarded  as  the  Fourier  transform  of  a  discrete  distribution  function  F: 

y(h)  =  cos(ft>  h)  =  j  e'hXdF(X), 

J  ( — 7T,7r] 

where 

(  0,  for  A  <  —cu; 

F( A)  =  <  1/2,  for—®  <  A  <  ®;  (6.5) 

(  1,  for  A  >  a). 

The  integral  with  respect  to  the  discrete  distribution  function  is  a  so-called  Riemann- 
Stieltjes  integral.3  F  is  a  step  function  with  jumps  at  —o>  and  o>  and  step  size  of  1/2 
so  that  the  above  integral  equals  ^e-' hu>  +  ^e-' hu>  =  cos (ha>). 

These  considerations  lead  to  a  representation,  called  the  spectral  representation , 
of  the  autocovariance  function  as  the  Fourier  transform  a  distribution  function  over 

[—71,  7T ] . 


3The  Riemann-Stieltjes  integral  is  a  generalization  of  the  Riemann  integral.  Let  /  and  g  be  two 
bounded  functions  defined  on  the  interval  [«,  b]  then  the  Riemann-Stieltjes  integral  f^  f(x)dg(x)  is 
defined  as  lim^oo  J2"=i /(£;) [gfc)  —  g(*i-i)]  where  a  =  x i  <x2  <  ■■  ■  <  xn-\  <  x„  =  b.  For 
g(x)  =  x  we  obtain  the  standard  Riemann  integral.  If  g  is  a  step  function  with  a  countable  number 
of  steps  xt  of  height  hi  then  fa  J'(x)dg(x)  =  "EifiXiVn. 
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Theorem  6.2  (Spectral  Representation),  y  is  the  autocovariance  function  of  a 
stationary  process  {X,}  if  and  only  if  there  exists  a  right-continuous,  nondecreasing, 
bounded  function  F  on  (— jr,  : r]  with  the  properties  F(—ti)  =  0  and 

y(h)  =  f  e,hXdF(X).  (6.6) 

J  (—7r,n] 

F  is  called  the  spectral  distribution  function  of  y. 

Remark  6.2.  If  the  spectral  distribution  function  F  has  a  density  /  such  that  F( X)  = 
f_nf(co)Aco  then/  is  called  the  spectral  density  and  the  time  series  is  said  to  have  a 
continuous  spectrum. 


Remark  6.3.  According  to  the  Lebesgue-Radon-Nikodym  Theorem  (see,  for  exam¬ 
ple,  Rudin  (1987)),  the  spectral  distribution  function  F  can  be  represented  uniquely 
as  the  sum  of  a  distribution  function  Fz  which  is  absolutely  continuous  with  respect 
to  the  Lebesgue  measure  and  a  discrete  distribution  function  Fv.  The  distribution 
function  Fz  corresponds  to  the  regular  part  of  the  Wold  Decomposition  (see 
Theorem  3.1  in  Sect.  3.2)  and  has  spectral  density 


Mx)  =  f\ne-X)\2  =  f 

2  7t  '  '2  71 


Y 

j=  0 


-ijX 


The  discrete  distribution  Fv  corresponds  to  the  deterministic  part  {  V,}. 


The  process  (6.4)  considers  just  a  single  frequency  co.  We  may,  however, 
generalize  this  process  by  superposing  several  sinusoids.  This  leads  to  the  class  of 
harmonic  processes : 

k 

X,  =  ^2  Aj  cos (o)j  t)  +  Bj  sin (a>j  t),  0  <  a>\  <  ■  ■  ■  <  cou  <  tt  (6.7) 

7=1 

where  . . .  ,A^,Bk  are  random  variables  which  are  uncorrelated  with  each 

other  and  which  have  means  E Aj  =  E Bj  =  0  and  variances  YAj  =  YBj  = 
(T?,  j  =  \ .....  k.  The  autocovariance  function  of  such  a  process  is  given  by 
y{h)  =  Yj=\  cos(ftJj  h).  According  to  the  spectral  representation  theorem  the 
corresponding  distribution  function  F  can  be  represented  as  a  weighted  sum  of 
distribution  functions  like  those  in  Eq.  (6.5): 

k 

F(X)  = 

7=1 
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with 


{  0,  for  A  <  —coy, 

Fj( A)  =  <  1/2,  for  —coj  <  A  <  coy, 
(  1,  for  A  >  coj. 

This  generalization  points  to  the  following  properties: 


•  Each  of  the  k  components  Aj  cos  (coj  t )  +  Bj  sin(cn/'  t),j  =  1 , . . . ,  k,  is  completely 
associated  to  a  specific  frequency  Wj. 

•  The  k  components  are  uncorrelated  with  each  other. 

•  The  variance  of  each  component  is  aj .  The  contribution  of  each  component  to 

the  variance  of  X,  given  by  Ylj=i  °f  therefore  is  aj. 

•  F  is  a  nondecreasing  step-function  with  jumps  at  frequencies  a>  =  ±oj;  and  step 
sizes  i(T;2. 

•  The  corresponding  probability  distribution  is  discrete  with  values  jay  at  the 
frequencies  co  =  ±coj  and  zero  otherwise. 


The  interesting  feature  of  harmonic  processes  as  represented  in  Eq.  (6.7)  is  that 
every  stationary  process  can  be  represented  as  the  superposition  of  uncorrelated 
sinusoids.  However,  in  general  infinitely  many  (even  uncountably  many)  of  these 
processes  have  to  be  superimposed.  The  generalization  of  (6.7)  then  leads  to  the 
spectral  representation  of  a  stationary  stochastic  process: 

Xt=  f  c,,xdZ(X).  (6.8) 

J  (— 7T,7 r] 

Thereby  {Z( A)}  is  a  complex-valued  stochastic  process  with  uncorrelated  incre¬ 
ments  defined  on  the  interval  ( — tt,  tt].  The  above  representation  is  known  as 
the  spectral  representation  of  the  process  {X,}.A  Note  the  analogy  to  the  spectral 
representation  of  the  autocovariance  function  in  Eq.  (6.6). 

For  the  harmonic  processes  in  Eq.  (6.7),  we  have: 


dZ(A)  = 


Aj+j  Bj ,  for  A  =  —  a>j  and  j  =  1, 2, . . . ,  k\ 
•  —  2' Bj ,  for  A  =  a)j  and  j  =  1,2, ...  ,k; 

0,  otherwise. 


In  this  case  the  variance  of  dZ  is  given  by: 

EdZ(A)dZ(A)  =  ]  -2  ’  if  A  =  ±a>J’ 
I  0,  otherwise. 


4  A  mathematically  precise  statement  is  given  in  Brockwell  and  Davis  (1991,  chapter  4)  where  also 
the  notion  of  stochastic  integration  is  explained. 
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In  general,  we  have: 


!F( A)  —  F( A  ),  discrete  spectrum; 

/( A)dA,  continuous  spectrum. 

Thus,  a  large  jump  of  the  spectrum  at  frequency  A  is  associated  with  a  large 
sinusoidal  component  with  frequency  A.5 


6.3  The  Periodogram  and  the  Estimation  of  Spectral  Densities 

Although  the  spectral  distribution  function  is  uniquely  determined,  its  estimation 
from  a  finite  sample  with  realizations  {xi,X2, . . . , xj }  is  not  easy.  This  has  to  do 
with  the  problem  of  estimating  a  function  from  a  finite  number  of  points.  We  will 
present  two-approaches:  a  non-parametric  and  a  parametric  one. 


6.3.1  Non-Parametric  Estimation 


A  simple  estimator  of  the  spectral  density,  fj (A),  can  be  obtained  by  replacing  in 
the  defining  equation  (6.1)  the  theoretical  autocovariances  y  by  their  estimates  y. 
However,  instead  of  a  simple  sum,  we  consider  a  weighted  sum: 


fr(X)  =  — 

2n 


y(h)e~,l,x. 


(6.9) 


The  weighting  function  k,  also  known  as  the  lag  window,  is  assumed  to  have 
exactly  the  same  properties  as  the  kernel  function  introduced  in  Sect.  4.4.  This 
correspondence  is  not  accidental,  indeed  the  long-run  variance  defined  in  Eq.  (4.1) 
is  just  2 n  times  the  spectral  density  evaluated  at  A  =  0.  Thus,  one  might  choose 
a  weighting,  kernel  or  lag  window  from  Table  4.1,  like  the  Bartlett- window,  and 
use  it  to  estimate  the  spectral  density.  The  lag  truncation  parameter  is  chosen  in 
such  a  way  that  — »•  oo  as  T  oc.  The  rate  of  divergence  should,  however,  be 
smaller  than  T  so  that  y  approaches  zero  as  T  goes  to  infinity.  As  an  estimator  of 
the  autocovariances  one  uses  the  estimator  given  in  Eq.  (4.2)  of  Sect.  4.2. 

The  above  estimator  is  called  an  indirect  spectral  estimator  because  it  requires 
the  estimation  of  the  autocovariances  in  the  first  step.  The  periodogram  provides  an 
alternative  direct  spectral  estimator.  For  this  purpose,  we  represent  the  observations 
as  linear  combinations  of  sinusoids  of  specific  frequencies.  These  so-called  Fourier 
frequencies  are  defined  as  u>k  =  yA ,  k  =  —  ,  •  •  • .  LtJ  •  Thereby  [xj  denotes 


5Thereby  F( X  )  denotes  the  left-sided  limit,  i.e.  F( X  )  =  lim,,,^  F(a>). 
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the  largest  integer  smaller  or  equal  to  x.  With  this  notation,  the  observations  xt, 
t  =  1 , . . . ,  T,  can  be  represented  as  a  sum  of  sinusoids: 

LfJ  LfJ 

xt  =  ^  ait&“°k'  =  ^  ak(cos(cOkt)  +  i  sin(&4?)). 

*=- L^J  i=-LVJ 

The  coefficients  | «/,  j  are  the  discrete  Fourier-transform  of  the  observations 
{x\,X2, . . .  .xt}-  The  periodogram  IT  is  then  defined  as  follows. 

Definition  6.2  (Periodogram).  Given  observations  {x\.xi, . . .  .xr),  the  peri¬ 
odogram  is  defined  as  the  function 

hW  =  j 


r=l 


For  each  Fourier- frequency  a>k,  the  periodogram  /-/(a;/,)  equals  Uk | 2 .  This  implies 
that 

t  LfJ  LfJ 

X!  ix,i2  =  ia*i2  =  7r(a>*)- 

r=l  k=-\_t^L\  k=-  L^J 

The  value  of  the  periodogram  evaluated  at  the  Fourier-frequency  a>k  is  therefore 
nothing  but  the  contribution  of  the  sinusoid  with  frequency  a>k  to  the  variation  of  {xt} 
as  measured  by  sum  of  squares.  In  particular,  for  any  Fourier-frequency  different 
from  zero  we  have  that 

T- 1 

h(oh)  =  £  9(h)e-'h<°k- 

/i=-r+i 

Thus  the  periodogram  represents,  disregarding  the  proportionality  factor  2 tz,  the 
sample  analogue  of  the  spectral  density  and  therefore  carries  the  same  information. 

Unfortunately,  it  turns  out  that  the  periodogram  is  not  a  consistent  estimator  of 
the  spectral  density.  In  particular,  the  covariance  between  and  A i  ^ 

Xi.  goes  to  zero  for  T  going  to  infinity.  The  periodogram  thus  has  a  tendency  to  get 
very  jagged  for  large  T  leading  to  the  detection  of  spurious  sinusoids.  A  way  out  of 
this  problem  is  to  average  the  periodogram  over  neighboring  frequencies,  thereby 
reducing  its  variance.  This  makes  sense  because  the  variance  is  relatively  constant 
within  a  small  frequency  band.  The  averaging  (smoothing)  of  the  periodogram  over 
neighboring  frequencies  leads  to  the  class  of  discrete  spectral  average  estimators 
which  turn  out  to  be  consistent: 

/r(A)  =  —  ^2  KT(h)IT  I  d)T,x  H - —  j 

\h\  <It 


(6.10) 
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where  co t,\  denotes  the  multiple  of  y-  which  is  closest  to  A.  It  is  the  bandwidth 
of  the  estimator,  i.e.  the  number  of  ordinates  over  which  the  average  is  taken.  lT 
satisfies  the  same  properties  as  in  the  case  of  the  indirect  spectral  estimator  (6.9): 

— >  oo  and  It/T  —*■  0  for  T  — >  oo.  Thus,  as  T  goes  to  infinity,  on  the  one 
hand  the  average  is  taken  over  more  and  more  values,  but  on  the  other  hand  the 
frequency  band  over  which  the  average  is  taken  is  getting  smaller  and  smaller. 
The  spectral  weighting  function  or  spectral  window  KT  is  a  positive  even  function 
satisfying  J2\h\<iT  ktW  =  1  and  J2\h\<iT  Krih)  ^  0  for  T  ^  oo.  It  can  be 
shown  that  under  these  conditions  the  discrete  spectral  average  estimator  is  mean- 
square  consistent.  Moreover,  the  estimator  in  Eq.  (6.9)  can  be  approximated  by  a 
corresponding  discrete  spectral  average  estimator  by  defining  the  spectral  window  as 


Kt{co) 


=  —  v  *(AY 

27T  \^T ) 


-ihco 


\h\<tr 


or  vice  versa 


k(h)=  f  KT(co)e-,ha,dco. 

J  —  JT 

Thus,  the  lag  and  the  spectral  window  are  related  via  the  Fourier  transform.  For 
details  and  the  asymptotic  distribution  the  interested  reader  is  referred  to  Brockwell 
and  Davis  (1991,  ChapterlO).  Although  the  indirect  and  the  direct  estimator  give 
approximately  the  same  result  when  the  kernels  used  are  related  as  in  the  equation 
above,  the  direct  estimator  (6.10)  is  usually  preferred  in  practice  because  it  is, 
especially  for  long  times  series,  computationally  more  efficient,  in  particular  in 
connection  with  the  fast  fourier  transformation  (FFT).6 

A  simple  spectral  weighting  function,  known  as  the  Daniell  spectral  window,  is 
given  by  Kj(h)  =  (2ir+  l)-1  when  \h\  <  and  0  otherwise  and  where  =  sff  . 
It  averages  over  2  ('y  +  1  values  within  a  frequency  band  of  approximate  width  -^=. 
This  function  corresponds  to  the  Daniell  kernel  function  or  Daniell  lag  window 
k(x)  =  sin for  |x|  <  1  and  zero  otherwise  (see  Sect.  4.4).  In  practice,  the 
sample  size  is  fixed  and  the  researcher  is  faced  with  a  trade-off  between  variance  and 
bias.  On  the  one  hand,  a  weighting  function  which  averages  over  a  wide  frequency 
band  produces  a  smooth  spectral  density,  but  has  probably  a  large  bias  because  the 
estimate  of /(A)  depends  on  frequencies  which  are  rather  far  away  from  A.  On  the 
other  hand,  a  weighting  function  which  averages  only  over  a  small  frequency  band 
produces  a  small  bias,  but  probably  a  large  variance.  It  is  thus  advisable  in  practice 


6The  FFT  is  seen  as  one  of  the  most  important  numerical  algorithms  ever  as  it  allows  a  rapid 
computation  of  Fourier  transformations  and  its  inverse.  The  FFT  is  widely  in  digital  signal 
processing. 
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Fig.  6.2  Raw  periodogram  of  a  white  noise  time  series  ( X ,  ~  WN(0,  1),  T  =  200) 

to  work  with  alternative  weighting  functions  and  to  choose  the  one  which  delivers  a 
satisfying  balance  between  bias  and  variance. 

The  following  two  examples  demonstrate  the  large  variance  of  the  periodogram. 
The  first  example  consists  of  200  observations  from  a  simulated  white  noise 
time  series  with  variance  equal.  Whereas  the  true  spectrum  is  constant  equal  to 
one,  the  raw  periodogram,  i.e.  the  periodogram  without  smoothing,  plotted  in 
Fig.  6.2  is  quite  erratic.  However,  it  is  obvious  that  by  taking  averages  of  adjacent 
frequencies  the  periodogram  becomes  smoother  and  more  in  line  with  the  theoretical 
spectrum.  The  second  example  consists  of  200  observations  of  a  simulated  AR(2) 
process.  Figure  6.3  demonstrates  again  the  jaggedness  of  the  raw  periodogram. 
However,  these  erratic  movements  are  distributed  around  the  true  spectrum.  Thus, 
by  smoothing  one  can  hope  to  get  closer  to  the  true  spectrum  and  even  detect  the 
dominant  cycle  with  radian  equal  to  one.  It  is  also  clear  that  by  smoothing  over  a 
too  large  range,  in  the  extreme  over  all  frequencies,  no  cycle  could  be  detected. 

Figure  6.4  illustrates  these  considerations  with  real  life  data  by  estimating  the 
spectral  density  of  quarterly  growth  rates  of  real  investment  in  constructions  for  the 
Swiss  economy  using  alternative  weighting  functions.  To  obtain  a  better  graphical 
resolution  we  have  plotted  the  estimates  on  a  logarithmic  scale.  All  three  estimates 
show  a  peak  (local  maximum)  at  the  frequency  A  =  j.  This  corresponds  to  a  wave 
with  a  period  of  one  year.  The  estimator  with  a  comparably  wide  frequency  band 
(dotted  line)  smoothes  the  minimum  A  =  1  away.  The  estimator  with  a  comparable 
small  frequency  band  (dashed  line),  on  the  contrary,  reveals  additional  waves  with 
frequencies  A  =  0.75  and  0.3  which  correspond  to  periods  of  approximately  two, 
respectively  five  years.  Whether  these  waves  are  just  artifacts  of  the  weighting 
function  or  whether  there  really  exist  cycles  of  that  periodicity  remains  open. 
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Fig.  6.3  Raw  periodogram  of  an  AR(2)  process  ( X ,  =  0.9X,_i  —  0JX,—2  +  Z,  with  Z,  ~ 
WN(0,  1),  T  =  200) 


Fig.  6.4  Non-parametric  direct  estimates  of  a  spectral  density  with  alternative  weighting  functions 


6.3.2  Parametric  Estimation 

An  alternative  to  the  nonparametric  approaches  just  outlined  consists  in  the  estima¬ 
tion  of  an  ARMA  model  and  followed  by  deducing  the  spectral  density  from  it.  This 
approach  was  essentially  first  proposed  by  Yule  (1927). 

Theorem  6.3  (Spectral  Density  of  ARMA  Processes).  Let  {A,}  be  a  causal 
ARMA(p,q)  process  given  by  (\>{L)Xr  =  @(L)Z,  and  Z,  ~  WN( 0,  (T2).  Then  the 
spectral  density  fx  is  given  by 
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fx(  A)  = 


n2  |@(e~'A)| 
2jt  I  <S(e-'A)|2 


—it  <  A  <  7T. 


(6.11) 


Proof.  {X,}  is  generated  by  applying  the  linear  filter  T(L)  with  transfer  function 

vp(e_,A)  =  to  { Z ,}  (see  Sect.  6.4).  Formula  (6.11)  is  then  an  immediate 

2 

consequence  of  Theorem  6.5  because  the  spectral  density  of  {Z,}  is  equal  to  □ 


Remark  6.4.  As  the  spectral  density  of  an  ARMA  process  {A,}  is  given  by  a  quotient 
of  trigonometric  functions,  the  process  is  said  to  have  a  rational  spectral  density. 


The  spectral  density  of  the  AR(2)  process  X,  =  4>\X,-\  +  <p2^i-2  +  Z,  with 
Z,  ~  WN(0,  a2),  for  example,  is  then  given  by 


fx(  A)  = 


27r(l  +  <p\  +  202  +  <p2  +  2(0102  —  0i)  cos  A  —  402  cos2  A) 


The  spectral  density  of  an  ARMA(1,1)  process  X,  =  <pXt-\  +  Z,  +  0Z,_i  with 
Z,  ~  WN(0,ct2)  is 


,  n2(l  +  92  +  26  cos  A) 

fxQ-)  =  o  n  I  ^2  I  iJi - 77' 

27t(  1  +  0^  +  20  cos  A) 

An  estimate  of  the  spectral  density  is  then  obtained  by  replacing  the  unknown 
coefficients  of  the  ARMA  model  by  their  corresponding  estimates  and  by  applying 
Formula  (6.11)  to  the  estimated  model.  Figure  6.5  compares  the  nonparametric  to 
the  parametric  method  based  on  an  AR(4)  model  using  the  same  data  as  in  Fig.  6.4. 
Both  methods  produce  similar  estimates.  They  clearly  show  waves  of  periodicity  of 
half  a  year  and  a  year,  corresponding  to  frequencies  j  and  it.  The  nonparametric 
estimate  is,  however,  more  volatile  in  the  frequency  band  [0.6, 1]  and  around  2.5. 


6.4  Linear  Time-Invariant  Filters 

Time-invariant  linear  filters  are  an  indispensable  tool  in  time  series  analysis. 
Their  objective  is  to  eliminate  or  amplify  waves  of  a  particular  periodicity.  For 
example,  they  may  be  used  to  purge  a  series  from  seasonal  movements.  The 
seasonally  adjusted  time  series  should  then  reflect  more  strongly  the  business 
cyclical  movements  which  are  viewed  to  have  period  length  between  two  and  eight 
years.  The  spectral  analysis  provides  just  the  right  tools  to  construct  and  analyze 
such  filters. 

Definition  6.3.  { Y,\  is  the  output  of  the  linear  time-invariant  filter  (LTF)  T  = 
{fj.j  =  0,  ±1,  ±2, . . .  }  applied  to  the  input  {X,}  if 
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Fig.  6.5  Comparison  of  nonparametric  and  parametric  estimates  of  the  spectral  density  of  the 
growth  rate  of  investment  in  the  construction  sector 


oo  oo 

y i  =  V(L)X,  —  ^  '  i f/jXt—j  with  E  \tj\  < 

j=-oo  7= -oo 

The  filter  is  called  causal  or  one-sided  if  i/r;  =  0/or  j  <  0;  otherwise  it  is  called 
two-sided. 

Remark  6.5.  Time-invariance  in  this  context  means  that  the  lagged  process  {  Kr_v} 
is  obtained  for  all  s  e  Z  from  {X,_s}  by  applying  the  same  filter  T. 

Remark  6.6.  MA  processes,  causal  AR  processes  and  causal  ARMA  processes  can 
be  viewed  as  filtered  white  noise  processes. 

It  is  important  to  recognize  that  the  application  of  a  filter  systematically  changes 
the  autocorrelation  properties  of  the  original  time  series.  This  may  be  warranted 
in  some  cases,  but  may  lead  to  the  “discovery”  of  spurious  regularities  which  just 
reflect  the  properties  of  the  filter.  See  the  example  of  the  Kuznets  filter  below. 

Theorem  6.4  (Autocovariance  Function  of  Filtered  Process).  Let  {X,}  be  a  mean- 
zero  stationary  process  with  autocovariance  function  yx-  Then  the  filtered  process 
{Tf}  defined  as 


OO 

Y,=  J2  fjxH  =  ^(L)Xf 

j=-oo 
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with  ^/^_oc  IV'}- 1  <  oo  is  also  a  mean-zero  stationary  process  with  autocovariance 
function  yy.  Thereby  the  two  autocovariance  functions  are  related  as  follows: 

oo  oo 

yY(h)  =  X!  X!  fjfayxih  +  k  h  =  0,  ±1,  ±2, . . . 

j= — oo  k=—o o 


Proof  We  first  show  the  existence  of  the  output  process  {F,}.  For  this  end,  consider 
the  sequence  of  random  variables  {  F/™*}»i=  1,2,...  defined  as 

m 

Y(rW>  =  E  W-P 

j=—m 


To  show  that  the  limit  for  m  — >  oo  exists  in  the  mean  square  sense,  it  is,  according 
to  Theorem  C.6,  enough  to  verify  the  Cauchy  criterion 


E 


for  m,  n  — >  oo. 


Taking  without  loss  of  generality  m  >  n.  Minkowski’s  inequality  (see  Theorem  C.2 
or  triangular  inequality)  leads  to 


/ 

m  n 

2\ 

E 

E  W'-j  -  E 

V 

j=—m  j=~n 

/ 

( 

m 

2\ 

1/2 

f 

—m 

2\ 

< 

E 

E  ^x<-j 

+ 

E 

E  ^x,-j 

j=n+ 1 

) 

j=—n—l 

/ 

Using  the  Cauchy-Bunyakovskii-Schwarz inequality  and  the  stationarity  of  {X,},  the 
first  term  on  the  right  hand  side  is  bounded  by 


i/2 


r/2 


E  E  \i'jXt-itk'X-t-k\  <  E  \'l'Mk\E(\x,-j\\xt-k\) 


j,k=n-\- 1 


Kj,k=n-\- 1 


<  i  e  \fMk\mlj)1/2 

\j.k=n+ 1 


r/2 


(EXlk)112 


Yx( 0)1/2  E  IVO'I- 

j=n+ 1 
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As  Ylj^—oo  |  \f/j\  <  oo  by  assumption,  the  last  term  converges  to  zero.  Thus,  the  limit 

of  { Y,},  m  —*■  oo,  denoted  by  S,,  exists  in  the  mean  square  sense  with  ESj  <  oo. 

In  remains  to  show  that  St  and  Ylj^-oo  ^jXt-j  are  actually  equal  with  probability 
one.  This  is  established  by  noting  that 


E  1 5,  —  T'(L)Xf|  =  Eliminf 

m—>  oo 


<  liminfE 

m— >oo 


st-  £^ 

j——m 

m 

st-  £  x'-j 


j=—m 


where  use  has  been  of  Fatou’s  lemma. 

The  stationarity  of  {  Y,  \  can  be  checked  as  follows: 


j=—m 


E Yt  =  lim  £  EXM  =  0, 

m— ►oo  '  J  J 

m  \  /  m 

£1  XhXt-J  I  (  £^  l/r kXt-h-k 


E YtY,-h  =  lim  E 


\J=-in 


\k=—m 


oo  oo 


:  £  £  tyjtkVxQl  +  k-j). 

j=— oo  k=—o o 


Thus,  EYt  and  E  Y,  Y,-i,  are  finite  and  independent  of  t.  {  Y,  j  is  therefore  stationary. 

□ 


Corollary  6.2.  IfX,  ~  WN(0,rr2)  and  Y,  =  ££  fijXt-j  with  Ylj^o  I  Vol  <  00 
t/ie/7  the  above  expression  for  yy{h)  simplifies  to 

OO 

yY(h)  =  O1  £  l/O'l/O'+W  • 

7=0 


Remark  6.1.  In  the  proof  of  the  existence  of  {Y,},  the  assumption  of  the  stationarity 
of  {X,}  can  be  weakened  by  assuming  only  sup,  EXf  <  oo. 

Theorem  6.5.  Under  the  conditions  of  Theorem  6.4,  the  spectral  densities  of  {X,} 
and  J  Y,  j  are  related  as 

fY(  A)  =  |^(e-a)|2/x(A)  =  «r(e,A)^(e-,Ayx(A) 

where  *I'(e~,;i)  =  Ylj^-oo  ^(e-1,1)  is  called  the  transfer  function  of  the 

filter. 
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To  understand  the  effect  of  the  filter  Tk  consider  the  simple  harmonic  process 
X i  =  2cos(A  t)  =  e,x'  +  Passing  {X,}  through  the  filter  'P  leads  to  a 

transformed  time  series  {Yt}  defined  as 

Y,  =  2  |'P(e“a)|  cos  (x  (t- 

where  9{ A)  =  arg  'P(e~'2).  The  filter  therefore  amplifies  some  frequencies  by  the 
factor  g(A)  =  |'P(e~!'l)|  and  delays  X,  by  periods.  Thus,  we  have  a  change  in 
amplitude  given  by  the  amplitude  gain  function  g(A)  and  a  phase  shift  given  by  the 
phase  gain  function  0(A).  If  the  gain  function  is  bigger  than  one  the  corresponding 
frequency  is  amplified.  On  the  other  hand,  if  the  value  is  smaller  than  one  the 
corresponding  frequency  is  dampened. 


Examples  of  Filters 

•  First  differences  (changes  with  respect  to  previous  period): 

vp(L)  =  A  =  1  -  L. 

The  transfer  function  of  this  filter  is  (1  —  eX,x)  and  the  gain  function  is  2(1  — 
cos  A).  These  functions  take  the  value  zero  for  A  =  0.  Thus,  the  filter  eliminates 
the  trend  which  can  be  considered  as  a  wave  with  an  infinite  period  length. 

•  Change  with  respect  to  same  quarter  last  year,  assuming  that  the  data  are 
quarterly  observations: 


TTL)  =  1  -  L4. 

The  transfer  function  and  the  gain  function  are  1  —  e~4,A  and  2(1  —  cos(4A)), 
respectively.  Thus,  the  filter  eliminates  all  frequencies  which  are  multiples  of  f 
including  the  zero  frequency.  In  particular,  it  eliminates  the  trend  and  waves  with 
periodicity  of  four  quarters. 

•  A  famous  example  of  a  filter  which  led  to  wrong  conclusions  is  the  Kuznets 
filter  (see  Sargent  1987,  273-276).  Assuming  yearly  data,  this  filter  is  obtained 
from  two  transformations  carried  out  in  a  row.  The  first  transformation  which 
should  eliminate  cyclical  movements  takes  centered  five  year  moving  averages. 
The  second  one  take  centered  non-overlapping  first  differences.  Thus,  the  filter 
can  be  written  as: 

TTL)  =  i  (L-2  +  L"1  +  1  +  L  +  L2)  (L-5  -  L5)  . 


first  transformation 


second  transformation 
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Fig.  6.6  Transfer  function  of  the  Kuznets  filters 


Figure  6.6  gives  a  plot  of  the  transfer  function  of  the  Kuznets  filter.  Thereby  it 
can  be  seen  that  all  frequencies  are  dampened,  except  those  around  X  =  0.2886. 
The  value  X  =  0.2886  corresponds  to  a  wave  with  periodicity  of  approximately 
2^/0.2886  =  21.77  rs  22  years.  Thus,  as  first  claimed  by  Flowrey  (1968),  even  a 
filtered  white  noise  time  series  would  exhibit  a  22  year  cycle.  This  demonstrates 
that  cycles  of  this  length,  related  by  Kuznets  (1930)  to  demographic  processes 
and  infrastructure  investment  swings,  may  just  be  an  artefact  produced  by  the 
filter  and  are  therefore  not  endorsed  by  the  data. 


6.5  Some  Important  Filters 

6.5.1  Construction  of  Low-  and  High-Pass  Filters 


For  some  purposes  it  is  desirable  to  eliminate  specific  frequencies.  Suppose,  we 
want  to  purge  a  time  series  from  all  movements  with  frequencies  above  Xc,  but  leave 
those  below  this  value  unchanged.  The  transfer  function  of  such  an  ideal  low-pass 
filter  would  be: 


'I'(e",A) 


1,  for  X  <  Xc\ 
0,  for  X  >  Xc. 


By  expanding  \P(e-,A)  into  a  Fourier-series  vf/  (e— ' A )  =  YLjZ-oo  V/;e_,^A>  it  is 
possible  to  determine  the  appropriate  filter  coefficients  {i jrj\.  In  the  case  of  a  low- 
pass  filter  they  are  given  by: 


e-,jcodco 


k 

sinp'Ac) 

j * 


7  =  0; 
■  7^0. 
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The  implementation  of  the  filter  in  practice  is  not  straightforward  because  only  a 
finite  number  of  coefficients  can  be  used.  Depending  on  the  number  of  observations, 
the  filter  must  be  truncated  such  that  only  those  i j/j  with  [/|  <  q  are  actually 
employed.  The  problem  becomes  more  severe  as  one  gets  to  the  more  recent 
observations  because  less  future  observations  are  available.  For  the  most  recent 
period  even  no  future  observation  is  available.  This  problem  is  usually  overcome 
by  replacing  the  missing  future  values  by  their  corresponding  forecast.  Despite  this 
remedy,  the  filter  works  best  in  the  middle  of  the  sample  and  is  more  and  more 
distorted  as  one  approaches  the  beginning  or  the  end  of  the  sample. 

Analogously  for  low-pass  filters,  it  is  possible  to  construct  high-pass  filters. 
Figure  6.7  compares  the  transfer  function  of  an  ideal  high-pass  filter  with  two  filters 
truncated  at  q  =  8  and  q  =  32,  respectively.  Obviously,  the  transfer  function  with 
the  higher  q  approximates  the  ideal  filter  better.  In  the  neighborhood  of  the  critical 
frequency,  in  our  case  7t/16,  however,  the  approximation  remains  inaccurate.  This 
is  known  as  the  Gibbs  phenomenon. 


6.5.2  The  Hodrick-Prescott  Filter 

The  Hodrick-Prescott  filter  (HP-Filter)  has  gained  great  popularity  in  the  macroeco¬ 
nomic  literature,  particularly  in  the  context  of  the  real  business  cycles  theory.  This 
high-pass  filter  is  designed  to  eliminate  the  trend  and  cycles  of  high  periodicity  and 
to  emphasize  movements  at  business  cycles  frequencies  (see  Hodrick  and  Prescott 
1980;  King  and  Rebelo  1993;  Brandner  and  Neusser  1992). 

One  way  to  introduce  the  HP-filter  is  to  examine  the  problem  of  decompos¬ 
ing  a  time  series  {A,}  additively  into  a  growth  component  {G,}  and  a  cyclical 
component  {C,}: 


X,  =  G,  +  Ct. 


This  decomposition  is,  without  further  information,  not  unique.  Following  the 
suggestion  of  Whittaker  (1923),  the  growth  component  should  be  approximated  by  a 
smooth  curve.  Based  on  this  recommendation  Hodrick  and  Prescott  suggest  to  solve 
the  following  restricted  least-squares  problem  given  a  sample  {A,},=i j\ 

T  T- 1 

Y,  (X,  -  G ,)2  +  AV  [(G,+ 1  -  G,)  -  (G,  -  Gf_i)]2  — ►  min . 

,=  1  t=2  {Ct} 

The  above  objective  function  has  two  terms.  The  first  one  measures  the  fit  of  {G,}  to 
the  data.  The  closer  { Gt }  is  to  {A/}  the  smaller  this  term  becomes.  In  the  limit  when 
Gr  =  X,  for  all  t,  the  term  is  minimized  and  equal  to  zero.  The  second  term  measures 
the  smoothness  of  the  growth  component  by  looking  at  the  discrete  analogue  to  the 
second  derivative.  This  term  is  minimized  if  the  changes  of  the  growth  component 
from  one  period  to  the  next  are  constant.  This,  however,  implies  that  Gt  is  a  linear 
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Fig.  6.7  Transfer  function  of  HP-filter  in  comparison  to  high-pass  filters 


function.  Thus  the  above  objective  function  represents  a  trade-off  between  fitting 
the  data  and  smoothness  of  the  approximating  function.  This  trade-off  is  governed 
by  the  meta-parameter  A  which  must  be  fixed  a  priori. 

The  value  of  A  depends  on  the  critical  frequency  and  on  the  periodicity  of  the 
data  (see  Uhlig  and  Ravn  2002,  for  the  latter).  Following  the  proposal  by  Hodrick 
and  Prescott  (1980)  the  following  values  for  A  are  common  in  the  literature: 

{6.25,  yearly  observations; 

1600,  quarterly  observations; 

14400,  monthly  observations. 

It  can  be  shown  that  these  choices  for  A  practically  eliminate  waves  of  periodicity 
longer  than  eight  years.  The  cyclical  or  business  cycle  component  is  therefore 
composed  of  waves  with  periodicity  less  than  eight  years.  Thus,  the  choice  of  A 
implicitly  defines  the  business  cycle.  Figure  6.7  compares  the  transfer  function  of 
the  HP-filter  to  the  ideal  high-pass  filter  and  two  approximate  high-pass  filters.7 

As  an  example,  Fig.  6.8  displays  the  HP-filtered  US  logged  GDP  together  with 
the  original  series  in  the  upper  panel  and  the  implied  business  cycle  component  in 
the  lower  panel. 


7As  all  filters,  the  HP-filter  systematically  distorts  the  properties  of  the  time  series.  Harvey  and 
Jaeger  (1993)  show  how  the  blind  application  of  the  HP-filter  can  lead  to  the  detection  of  spurious 
cyclical  behavior. 


130 


6  Spectral  Analysis  and  Linear  Filters 


Fig.  6.8  HP-filtered  US  logged  GDP  ( upper  panel)  and  cyclical  component  ( lower  pane!) 


6.5.3  Seasonal  Filters 

Besides  the  elimination  of  trends,  the  removal  of  seasonal  movements  represents 
another  important  application  in  practice.  Seasonal  movements  typically  arise  when 
a  time  series  is  observed  several  times  within  a  year  giving  rise  to  the  possibility  of 
waves  with  periodicity  less  than  twelve  month  in  the  case  of  monthly  observations, 
respectively  of  four  quarters  in  the  case  of  quarterly  observations.  These  cycles 
are  usually  considered  to  be  of  minor  economic  interest  because  they  are  due 
to  systematic  seasonal  variations  in  weather  conditions  or  holidays  (Easter,  for 
example).8  Such  variations  can,  for  example,  reduce  construction  activity  during 
the  winter  season  or  production  in  general  during  holiday  time.  These  cycles  have 
usually  quite  large  amplitude  so  that  they  obstruct  the  view  to  the  economically  and 
politically  more  important  business  cycles.  In  practice,  one  may  therefore  prefer  to 
work  with  seasonally  adjusted  series.  This  means  that  one  must  remove  the  seasonal 
components  from  the  time  series  in  a  preliminary  stage  of  the  analysis.  In  section,  we 
will  confine  ourself  to  only  few  remarks.  Comprehensive  treatment  of  seasonality 
can  be  found  in  Hylleberg  (1986)  and  Ghysels  and  Osborn  (2001). 

Two  simple  filters  for  the  elimination  of  seasonal  cycles  in  the  case  of  quarterly 
data  are  given  by  the  one-sided  filter 

'k(L)  =  (l  +  L  +  L2  +  L3)/4 


:An  exception  to  this  view  is  provided  by  Miron  (1996). 
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Fig.  6.9  Transfer  function  of  growth  rate  of  investment  in  the  construction  sector  with  and  without 
seasonal  adjustment 


or  the  two-sided  filter 


1,1  11,1, 

'I'(L)  =  -L2  +  -L  +  -  +  -L-1  +  -L~2. 

V  ;  8  4  4  4  8 

In  practice,  the  so-called  X-l  1-Filter  or  its  enhanced  versions  X-12  and  X-13 
filter  developed  by  the  United  States  Census  Bureau  are  often  applied.  This  filter  is 
a  two-sided  filter  which  makes,  in  contrast  to  two  examples  above,  use  of  all  sample 
observations.  As  this  filter  not  only  adjusts  for  seasonality,  but  also  corrects  for 
outliers,  a  blind  mechanical  use  is  not  recommended.  Gomez  and  Maravall  (1996) 
developed  an  alternative  method  known  under  the  name  TRAMO-SEATS.  More 
details  on  the  implementation  of  both  methods  can  be  found  in  Eurostat  (2009). 

Figure  6.9  shows  the  effect  of  seasonal  adjustment  using  TRAMO-SEATS  by 
looking  at  the  corresponding  transfer  functions  of  the  growth  rate  of  construction 
investment.  One  can  clearly  discern  how  the  yearly  and  the  half-yearly  waves 
corresponding  to  the  frequencies  jr/2  and  n  are  dampened.  On  the  other  hand,  the 
seasonal  filter  weakly  amplifies  a  cycle  of  frequency  0.72  corresponding  to  a  cycle 
of  periodicity  of  two  years. 


6.5.4  Using  Filtered  Data 

Whether  or  not  to  use  filtered,  especially  seasonally  adjusted,  data  is  still  an  ongoing 
debate.  Although  the  use  of  unadjusted  data  together  with  a  correctly  specified 
model  is  clearly  the  best  choice,  there  is  a  nonnegligible  uncertainty  in  modeling 
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economic  time  series.  Thus,  in  practice  one  faces  several  trade-offs  which  must  be 
taken  into  account  and  which  may  depend  on  the  particular  context  (Sims  1974, 
1993;  Hansen  and  Sargent  1993).  One  the  one  hand,  the  use  of  adjusted  data  may 
disregard  important  information  on  the  dynamics  of  the  time  series  and  introduce 
some  biases.  On  the  other  hand,  the  use  of  unadjusted  data  encounters  the  risk 
of  misspecification,  especially  because  usual  measures  of  fit  may  put  too  large 
emphasis  on  fitting  the  seasonal  frequencies  thereby  neglecting  other  frequencies. 


6.6  Exercises 

Exercise  6.6.1. 

( i)  Show  that  the  process  defined  in  Eq.  (6.4)  has  an  autocovariance  function  equal 
to  y(h)  =  cos (coh). 

(ii)  Show  that  the  process  defined  in  Eq.  (6.7)  has  autocovariance  function 

k 

y(h)  =  erf  cos (cQjh) 

j=  i 

Exercise  6.6.2.  Compute  the  transfer  and  the  gain  function  for  the  following  filters: 


(i)  tf(L)  =  1-L 

(ii)  'I'(L)  =  1  —  L4 
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7.1  Definition,  Properties  and  Interpretation 

Up  to  now  the  discussion  concentrated  on  stationary  processes  and  in  particular 
ARMA  processes.  According  to  the  Wold  decomposition  theorem  (see 
Theorem  3.1)  every  purely  non-deterministic  processes  possesses  the  following 
representation: 


X,  =  M  +  'P  (L)Z, , 

where  {Z,}  ~  WN(0,  a2)  and  \jrj  <  oo.  Typically,  we  model  X,  as  an  ARMA 
process  so  that  'I'(L)  =  This  representation  implies: 

•  ISA,  =  /x, 

•  lim/,-,00  P ,Xt+h  =  /x. 

The  above  property  is  often  referred  to  as  mean  reverting  because  the  process  moves 
around  a  constant  mean.  Deviations  from  this  mean  are  only  temporary  or  transitory. 
Thus,  the  best  long-run  forecast  is  just  the  mean  of  the  process. 

This  property  is  often  violated  by  economic  time  series  which  typically  show 
a  tendency  to  growth.  Classic  examples  are  time  series  for  GDP  (see  Fig.  1.3)  or 
some  stock  market  index  (see  Fig.  1.5).  This  trending  property  is  not  compatible 
with  stationarity  as  the  mean  is  no  longer  constant.  In  order  to  cope  with  this 
characteristic  of  economic  time  series,  two  very  different  alternatives  have  been 
proposed.  The  first  one  consists  in  letting  the  mean  /x  be  a  function  of  time 
The  most  popular  specification  for  /x  (t)  is  a  linear  function,  i.e.  ji(t)  =  a  +  8t.  In 
this  case  we  get: 


Xj  —  cl  8t  -}-tp(L)Z/ 


linear  trend 
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The  process  {X,}  is  then  referred  to  as  a  trend-stationary  process.  In  practice  one  also 
encounters  quadratic  polynomials  of  t  or  piecewise  linear  functions.  For  example, 
pit)  =  ot\  +  Sit  for  t  <  to  and  p(t)  =  oti  +  Sit  for  1  >  to .  In  the  following,  we 
restrict  ourself  to  linear  trend  functions. 

The  second  alternative  assumes  that  the  time  series  becomes  stationary  after 
differentiation.  The  number  of  times  one  has  to  differentiate  the  process  to  achieve 
stationarity  is  called  the  order  of  integration.  If  d  times  differentiation  is  necessary, 
the  process  is  called  integrated  of  order  d  and  is  denoted  by  X,  ~  1(d).  If  the 
resulting  time  series,  AdX,  =  (1  —  L)dXt,  is  an  ARMA(p,q)  process,  the  original 
process  is  called  an  ARIMA(p,d,q)  process.  Usually  it  is  sufficient  to  differentiate 
the  time  series  only  once,  i.e.  d  =  1.  For  expositional  purposes  we  will  stick  to  this 
case. 

The  formal  definition  of  an  1(1)  process  is  given  as  follows. 

Definition  7.1.  The  stochastic  process  {X,}  is  called  integrated  of  order  one  or 
difference-stationary,  denoted  as  X,  ~  1(1),  if  and  only  if  AX,  =  X,  —  X,-\  can  be 
represented  as 


AX,  =  (1  —  L)X,  =  8  +  'I'(L)Zf,  *F(  1)  ^  0, 

with  {Z,}  ~  WN(0,  a2)  and 

The  qualification  'F(l)  f  0  is  necessary  to  avoid  trivial  and  uninteresting 
cases.  Suppose  for  the  moment  that  'F(l)  =  0,  then  it  would  be  possible  to  write 
'F(L)  as  (1  —  L)'F(L)  for  some  lag  polynomial  'F(L).  This  would,  however,  imply 
that  the  factor  1  —  L  could  be  canceled  in  the  above  definition  so  that  {A,}  is 
already  stationary  and  that  the  differentiation  would  be  unnecessary.  The  assumption 
tF(  1 )  f  0  thus  excludes  the  case  where  a  trend- stationary  process  could  be  regarded 
as  an  integrated  process.  For  each  trend-stationary  process  X,  =  a  +  St+  VF(L)Z,  we 
have  AX,  =  S  +  'i>(L)Z,  with  >F(L)  =  (l-L)'F(L).  This  would  violate  the  condition 
tF(  1 )  f  0.  Thus  a  trend-stationary  process  cannot  be  a  difference- stationary  process. 

The  condition  <  00  implies  1 A/  <  oo  and  is  therefore 

stronger  than  necessary  for  the  Wold  representation  to  hold.  In  particular,  it  implies 
the  Beveridge-Nelson  decomposition  of  integrated  processes  into  a  linear  trend, 
a  random  walk,  and  a  stationary  component  (see  Sect.  7.1.4).  The  condition  is 
automatically  fulfilled  for  all  ARM  A  processes  because  {i//,}  decays  exponentially 
to  zero. 

Integrated  processes  with  d  >  0  are  also  called  unit-root  processes.  This 
designation  results  from  the  fact  that  ARIMA  processes  with  d  >  0  can  be  viewed 
as  ARMA  processes,  whereby  the  AR  polynomial  has  a  c/-fold  root  of  one.1  An 
important  prototype  of  an  integrated  process  is  the  random  walk  with  drift  S: 

X,  =  S+  X,-\  +  Z,,  Z,  ~  WN(0,  a2). 


'Strictly  speaking  this  does  not  conform  to  the  definitions  used  in  this  book  because  our  definition 
of  ARMA  processes  assumes  stationarity. 
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Trend-stationary  and  difference-stationary  processes  have  quite  different 
characteristics.  In  particular,  they  imply  different  behavior  with  respect  to  the 
long-run  forecast,  the  variance  of  the  forecast  error,  and  the  impulse  response 
function.  In  the  next  section,  we  will  explore  these  properties  in  detail. 


7.1.1  Long-Run  Forecast 

The  optimal  forecast  in  the  least-squares  sense  given  the  infinite  past  of  a  trend¬ 
stationary  process  is  given  by 


PA+ft  —  a  +  S(t  +  li)  +  fhZ,  +  i/f/,+  iZ,_i  +  . . . 

Thus  we  have 

OO 

lim  E  (P tXt+h  —  a  —  &(t  +  h)Y  =  a2  lim  Yh+i  =  0 

h — >oo  '  h — >oo L '  J 

1=0 

because  ij/2  <  oo.  Thus  the  long-run  forecast  is  given  by  the  linear  trend. 
Even  if  X,  deviates  temporarily  from  the  trend  line,  it  is  assumed  to  return  to  it. 
A  trend- stationary  process  therefore  behaves  in  the  long-run  like  fi(t)  =  a  +  St. 
The  forecast  of  the  differentiated  series  is 


P,AX,+/,  —  8  +  YhZt  +  \[rh+\Z,-\  +  i/f/?+2Z,_2  +  . . . 


The  level  of  X,+/ ,  is  by  definition 

Xr +h  =  (Xt+h  —  Xt+h-i)  +  (Xt+h- 1  —  Xt+h-2)  +  . . .  +  (Xt+i  — X ,)  +  X, 
so  that 

PA+i  =  P/  XXt+h  +  ¥,  AX,+h-\  +  . . .  +  P/ AA,+  i  +  Xt 
=  8  +  YhZ,  +  \lfh+\Zt-\  +  Yh+lZt-2  +  .  .  . 

+  8  +  Vbi-iZf  +  YhZ,-i  +  Yh+iZt-i  +  ■  ■  ■ 

+  S  +  Yh-lZt  +  Yh-\Zt-l  +  YhZ,- 2  +  .  .  . 

+ . . .  +  xt 

=  Xt  -(-  8h 

+  (Yh  +  fh-\  +  ■  ■  ■  +  Vh)  Zt 

+  (Yh+ 1  +  V'/i  +  •  •  •  +  V'b)  Z,- 1 
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This  shows  that  also  for  the  integrated  process  the  long-run  forecast  depends  on 
a  linear  trend  with  slope  8.  However,  the  intercept  is  no  longer  a  fixed  number, 
but  given  by  X,  which  is  stochastic.  With  each  new  realization  of  X,  the  intercept 
changes  so  that  the  trend  line  moves  in  parallel  up  and  down.  This  issue  can  be  well 
illustrated  by  the  following  two  examples. 

Example  1.  Let  {X,}  be  a  random  walk  with  drift  8.  Then  best  forecast  of  Xt+/,, 

P tX,+h,  is 


P,Xl+h  =  8h  +  Xt. 


The  forecast  thus  increases  at  rate  8  starting  from  the  initial  value  of  X,.  8  is  therefore 
the  slope  of  a  linear  trend.  The  intercept  of  this  trend  is  stochastic  and  equal  to  X,. 
Thus  the  trend  line  moves  in  parallel  up  or  down  depending  on  the  realization  of  X,. 

Example  2.  Let  {X,}  be  an  ARIMA(0, 1,1)  process  given  by  AX,  =  8  +  Z,  +  0Z,-\ 
with  |  (9 1  <  1.  The  best  forecast  of  Xr+/,  is  then  given  by 


F,Xl+h  =  Sh  +  X,  +  ez,. 


As  before  the  intercept  changes  in  a  stochastic  way,  but  in  contrary  to  the  previous 
example  it  is  now  given  by  X,  +  6Zt.  If  we  consider  the  forecast  given  the  infinite 
past,  the  invertibility  of  the  process  implies  that  Z,  can  be  expressed  as  a  weighted 
sum  of  current  and  past  realizations  of  AX,  (see  Sects.  2.3  and  3.1). 


7.1 .2  Variance  of  Forecast  Error 

In  the  case  of  a  trend-stationary  process  the  forecast  error  is 


X,+i,  —  Pf  Xt+h  —  Z,+h  +  \lr\Z,+h-\  +  •  •  •  +  i/g,— iZ,+i. 


As  the  mean  of  the  forecast  error  is  zero,  the  variance  is 

E  (x,+h  —  P/Z,+;,)  =  (l  +  l/ri  +  l/r2  +  •  •  •  +  V'/i-l)  a  - 

For  h  going  to  infinity  this  expression  converges  to  a1  Ylj^o  V7/  <  oo.  This  is 
nothing  but  the  unconditional  variance  of  X,.  Thus  the  variance  of  the  forecast  error 
increases  with  the  length  of  the  forecasting  horizon,  but  remains  bounded. 

For  the  integrated  process  the  forecast  error  can  be  written  as 


X,+h  —  P  tXt+h  —  zt+h  +  (1  +  \fri)Z,+h-\  + 


...  +  (1  -t-i/q  +  ^2  +  ■■■  +  Z,+ 1 . 
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The  forecast  error  variance  therefore  is 

E  (Xt+h  —  P/Xt+h)  =  |^1  +  (1  +  i/h)2  +  •  •  •  +  (1  +  tyi  +  •  •  •  +  i/hi-i)2J  o1 . 

This  expression  increases  with  the  length  of  the  forecast  horizon  h,  but  is  no  longer 
bounded.  It  increases  linearly  in  h  to  infinity.2  The  precision  of  the  forecast  therefore 
not  only  decreases  with  the  forecasting  horizon  h  as  in  the  case  of  the  trend¬ 
stationary  model,  but  converges  to  zero.  In  the  example  above  of  the  ARIMA(0,1,1) 
process  the  forecasting  error  variance  is 

E  (Xt+h  -  ¥,X,+It)2  =  [l  +  0 h  -  1)  (1  +  6 ?)2]  a2. 

This  expression  clearly  increases  linearly  with  h. 


7.1 .3  Impulse  Response  Function 

The  impulse  response  function  (dynamic  multiplier)  is  an  important  analytical  tool 
as  it  gives  the  response  of  the  variable  X,  to  the  underlying  shocks.  In  the  case  of  the 
trend- stationary  process  the  impulse  response  function  is 

fe+A  i  nr/ 

— — —  =  i jrh  — »  0  for h  ->  oo. 

OAjJ 

The  effect  of  a  shock  thus  declines  with  time  and  dies  out.  Shocks  have  therefore 
only  transitory  or  temporary  effects.  In  the  case  of  an  ARMA  process  the  effect  even 
declines  exponentially  (see  the  considerations  in  Sect.  2. 3). 3 

In  the  case  of  integrated  processes  the  impulse  response  function  for  AX,  implies: 


3P,Xt+h 

-4^  =  1  +  +  *2  +  •  •  •  + 

OJLjJ 

For  h  going  to  infinity,  this  expression  converges  \j/j  =  T(  1 )  ^  0.  This 

implies  that  a  shock  experienced  in  period  t  will  have  a  long-run  or  permanent 
effect.  This  long-run  effect  is  called  persistence.  If  { AX,}  is  an  ARMA  process  then 
the  persistence  is  given  by  the  expression 


2 Proof'.  By  assumption  {i/'y}  is  absolutely  summable  so  that  'I'(l)  converges.  Moreover,  as 
'I'(l)  7^  0,  there  exists  e  >  0  and  an  integer  m  such  that  Y^j=q  fj  >  £  f°r  aU  h  >  m-  The 
squares  are  therefore  bounded  from  below  by  e2  >  0  so  that  their  infinite  sum  diverges  to  infinity. 

3The  use  of  the  partial  derivative  is  just  for  convenience.  It  does  not  mean  thatX,q-*  is  differentiated 
in  the  literal  sense. 
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*(!)  = 


0(1) 

4»(D' 


Thus,  for  an  ARIMA(0,1,1)  the  persistence  ism'll)  =  0(1)  =  1  +  9.  In  the  next 
section  we  will  discuss  some  examples. 


7.1.4  The  Beveridge-Nelson  Decomposition 

The  Beveridge-Nelson  decomposition  represents  an  important  tool  for  the  under¬ 
standing  of  integrated  processes.4  It  shows  how  an  integrated  time  series  of  order 
one  can  be  represented  as  the  sum  of  a  linear  trend,  a  random  walk,  and  a  stationary 
series.  It  may  therefore  be  used  to  extract  the  cyclical  component  (business  cycle 
component)  of  a  time  series  and  can  thus  be  viewed  as  an  alternative  to  the  HP-filter 
(see  Sect.  6.5.2)  or  to  more  elaborated  so-called  structural  time  series  models  (see 
Sects.  17.1  and  17.4.2). 

Assuming  that  {A,}  is  an  integrated  process  of  order  one,  there  exists,  according 
to  Definition  7.1,  a  causal  representation  for  { AA,}: 

AA,  =  S  +  V(L)Zt  with  Z,  ~  WN  (0,  a2) 

with  the  property  Tfl )  ^  0  and  '}ZJ=()j\i/j\  <  oo.  Before  proceeding  to  the 
main  theorem,  we  notice  the  following  simple,  but  extremely  useful  polynomial 
decomposition  of  'T(L): 

'I'(L)  -  T'(l)  =  1  +  ifiL  +  i/f2L2  +  1//3L3  +  ^4L4  +  . . . 

=  iAi(L  -  1)  +  ^2(L2  -  1)  +  i/r3(L3  -  1)  +  3/^4 (L4  -  1)  +  . . . 

=  (L  —  1)  [1/0  +  t/r2(L  +  1)  +  ^(L"  +  L  +  1)  +  . . .] 

=  (L  -  l)[(i/6  +  ^2  +  ^3  +  ■■■)  +  (1A2  +  3  +  i/^4  +  •  •  -)L 
+  (^3  +  fa  +  fa  +  ■  ■  -)L2  +  ...]. 


We  state  this  results  in  the  following  Lemma: 

Lemma  7.1.  Let  'P(L)  =  £T_0  falJ,  then 

'P(L)  =  'T(l)  +  (L  —  1)¥(L) 
where  T(L)  =  fjU  with  1 }j  =  Ynlj+i  Vt 


4Neusser  (2000)  shows  how  a  Beveridge-Nelson  decomposition  can  also  be  derived  for  higher 
order  integrated  processes. 
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As  {A,}  is  integrated  and  because  Til)  ^  Owe  can  express  A,  as  follows: 


X,=X0  +  J2  A  Xj 
j=  i 


t 

=  A0  +  Y,  I5  +  [^d)  +  (L  -  1WD]  zj\ 

7=1 

t  t 

=  A0  +  8t+  'f'(l)  ZJ  +  X!(L  “  !)^(L )zj 

7=  1  7=1 


=  A0  +  Sr  +  ^(1)  J2ZJ  +^(L)Zq-^(L)4. 


linear  trend 


7=1 

random  walk 


stationary  component 


This  leads  to  the  following  theorem. 

Theorem  7.1  (Beveridge-Nelson  Decomposition).  Every  integrated  process  {A,} 
has  a  decomposition  of  the  following  form: 

t 

A,  =  Ao  +  St+  'f'(l)  J2ZJ  +  ^(L)Zp  -  T(L)Z, . 

linear  trend  ^ ^  stationary  component 

random  walk 


The  above  representation  is  referred  to  as  the  Beveridge-Nelson  decomposition. 

Proof  The  only  substantial  issue  is  to  show  that  \P(L)Zo  —  ^(L )Zt  defines  a 
stationary  process.  According  to  Theorem  6.4  it  is  sufficient  to  show  that  the 
coefficients  of  T'(L)  are  absolutely  summable.  We  have  that: 


oo 


oo 


oo 


i—j+ 1 


oo  oo  oo 

=  < 00 ' 
7=0  (=7+1  7=1 


where  the  first  inequality  is  a  consequence  of  the  triangular  inequality  and  the 
second  inequality  follows  from  the  Definition  7.1  of  an  integrated  process.  □ 


Shocks  of  a  random  walk  component  have  a  permanent  effect.  This  effect  is 
measured  by  the  persistence  Til),  the  coefficient  of  the  random  walk  component. 
In  macroeconomics  aggregate  supply  shocks  are  ascribed  to  have  a  long-run  effect 
as  they  affect  productivity.  In  contrast  monetary  or  demand  shocks  are  viewed  to 
have  temporary  effects  only.  Thus  the  persistence  TO)  can  be  interpreted  as  a 
measure  for  the  importance  of  supply  shocks  (see  Campbell  and  Mankiw  (1987), 
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Cochrane  (1988)  or  Christiano  and  Eichenbaum  (1990)).  For  a  critical  view  from  an 
econometric  standpoint  see  Hauser  et  al.  (1999).  A  more  sophisticated  multivariate 
approach  to  identify  supply  and  demand  shocks  and  to  disentangle  their  relative 
importance  is  provided  in  Sect.  15.5. 

In  business  cycle  analysis  it  is  often  useful  to  decompose  {A,}  into  a  sum  of  a 
trend  component  fi,  and  a  cyclical  component  e,: 


X,  —  //,  +  Er¬ 


in  the  case  of  a  difference- stationary  series,  the  cyclical  component  can  be  identified 
with  the  stationary  component  in  the  Beveridge-Nelson  decomposition  and  the  trend 
component  with  the  random  walk  plus  the  linear  trend.  Suppose  that  { AX,}  follows 
an  ARM  A  process  O(L)  AX,  =  c+@(L)Z,  then  A//,  =  <1  +  T  ( 1  )Z,  can  be  identified 
as  the  trend  component.  This  means  that  the  trend  component  can  be  recursively 
determined  from  the  observations  by  applying  the  formula 


Mr  = 


$(L) 

©(L) 


^(1)X„ 


The  cyclical  component  is  then  simply  the  residual:  e,  =  X,  —  /xt. 

In  the  above  decomposition  both  the  permanent  (trend)  component  as  well  as 
the  stationary  (cyclical)  component  are  driven  by  the  same  shock  Z,.  A  more 
sophisticated  model  would,  however,  allow  that  the  two  components  are  driven 
by  different  shocks.  This  idea  is  exploited  in  the  so-called  structural  time  series 
analysis  where  the  different  components  (trend,  cycle,  season,  and  irregular)  are 
modeled  as  being  driven  by  separated  shocks.  As  only  the  series  {X,}  is  observed, 
not  its  components,  this  approach  leads  to  serious  identification  problems.  See 
the  discussion  in  Harvey  (1989),  Hannan  and  Deistler  (1988),  or  Mills  (2003).  In 
Sects.  17.1  and  17.4.2  we  will  provide  an  overall  framework  to  deal  with  these 
issues. 


Examples 

Let  { AX,}  be  a  M A(q)  process  with  AX,  =  S  +  Z,  + . . .  +  QqZt-q  then  the  persistence 
is  given  simply  by  the  sum  of  the  MA-coefficients:  'F(l)  =  \  +  6\  +  6q. 

Depending  on  the  value  of  these  coefficients.  The  persistence  can  be  smaller  or 
greater  than  one. 

If  {AX,}  is  an  AR(1)  process  with  AX,  =  <5  +  0AX,_i  +Z,  and  assuming  0|  <  1 
then  we  get:  AX,  =  +  Yj^o  ft^t-j-  The  persistence  is  then  given  as  'F(l)  = 

Y% o  ft  =  •  For  positive  values  of  0,  the  persistence  is  greater  than  one.  Thus, 
a  shock  of  one  is  amplified  to  have  an  effect  larger  than  one  in  the  long-run. 

If  {AX,}  is  assumed  to  be  an  ARMA(  1,1)  process  with  AX,  =  <5  +  0AX,_  |  +  Z,  + 
0Z,_i  and  |0|  <  1  then  AX,  =  +  Z,  +  (0  +  9)  Yj^o  0,Z,-;- 1 .  The  persistence 

is  therefore  given  by  'F(l)  =  1  +  (0  +  6)  YjZo  ft  = 
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The  computation  of  the  persistence  for  the  model  estimated  for  Swiss  GDP  in 
Sect.  5.6  is  more  complicated  because  a  fourth  order  difference  1  —  L4  has  been 
used  instead  of  a  first  order  one.  As  1  —  L4  =  ( 1  —  L)  ( 1  +  L  +  L2  +  L3 ) ,  it  is  possible 
to  extend  the  above  computations  also  to  this  case.  For  this  purpose  we  compute  the 
persistence  for  (1  +  L  +  L2  +  L3 )  In  BIP,  in  the  usual  way.  The  long-run  effect  on 
In  B1P,  is  therefore  given  by  'T(l)/4  because  ( 1  +  L  +  L2  +  L3)  InBlP,  is  nothing 
but  four  times  the  moving-average  of  the  last  four  values.  For  the  AR(2)  model  we 
get  a  persistence  of  1.42  whereas  for  the  ARMA(1,3)  model  the  persistence  is  1 .34. 
Both  values  are  definitely  above  one  so  that  the  permanent  effect  of  a  one-percent 
shock  to  Swiss  GDP  is  amplified  to  be  larger  than  one  in  the  long-run.  Campbell 
and  Mankiw  (1987)  and  Cochrane  (1988)  report  similar  values  for  the  US. 


7.2  Properties  of  the  OLS  Estimator  in  the  Case 
of  Integrated  Variables 

The  estimation  and  testing  of  coefficients  of  models  involving  integrated  variables 
is  not  without  complications  and  traps  because  the  usual  asymptotic  theory  may 
become  invalid.  The  reason  being  that  the  asymptotic  distributions  are  in  general 
no  longer  normal  so  that  the  usual  critical  values  for  the  test  statistics  are  no  longer 
valid.  A  general  treatment  of  these  issues  is  beyond  this  text,  but  can  be  found  in 
Banerjee  et  al.  (1993)  and  Stock  (1994).  We  may,  however,  illustrate  the  kind  of 
problems  encountered  by  looking  at  the  Gaussian  AR(1)  case5: 


X,  =  +  Z,,  t  =  1, ,  2, . . . , 


where  Z,  ~  IID  N(0,  a2)  and  Xi,  =  0.  For  observations  on  X\ ,  X2, . . . ,  X/  the  OLS- 
estimator  of  0  is  given  by  the  usual  expression: 


<pT  = 


E,= 1  **-!*/  ,  I  z, 

^rT  x2  x2  ' 

Zw=lAr-l  Zw=lA/-l 


For  |0|  <  1,  the  OLS  estimator  of  0,  07,  converges  in  distribution  to  a  normal 
random  variable  (see  Chap.  5  and  in  particular  Sect.  5.2): 


Vf  (07 -  -  0)  ----- >  N  (0, 1  -  02)  . 

The  estimated  density  of  the  OLS  estimator  of  0  for  different  values  of  0  is 
represented  in  Fig.  7.1.  This  figure  was  constructed  using  a  Monte-Carlo  simulation 
of  the  above  model  for  a  sample  size  of  T  =  100  using  10,  000  replications  for 


5We  will  treat  more  general  cases  in  Sect.  7.5  and  Chap.  16. 
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Fig.  7.1  Distribution  of  the  OLS  estimator  of  <p  for  T  =  100  and  10,000  replications 


each  value  of  c/j  /  The  figure  shows  that  the  distribution  of  r/t-;-  becomes  more  and 
more  concentrated  if  the  true  value  of  (p  gets  closer  and  closer  to  one.  Moreover 
the  distribution  gets  also  more  and  more  skewed  to  the  left.  This  implies  that  the 
OLS  estimator  is  downward  biased  and  that  this  bias  gets  relatively  more  and  more 
pronounced  in  small  samples  as  (p  approaches  one. 

The  asymptotic  distribution  would  be  degenerated  for  <p  =  1  because  the 
variance  approaches  zero  as  (p  goes  to  one.  Thus  the  asymptotic  distribution 
becomes  useless  for  statistical  inferences  under  this  circumstance.  In  order  to  obtain 
a  non-degenerate  distribution  the  estimator  must  be  scaled  by  T  instead  by  -J~f .  It 
can  be  shown  that 


T 


d 

- >  V. 


This  result  was  first  established  by  Dickey  and  Fuller  (1976)  and  Dickey  and  Fuller 
(1981).  However,  the  asymptotic  distribution  v  need  no  longer  be  normal.  It  was  first 
tabulated  in  Fuller  (1976).  The  scaling  with  T  instead  of  \ff  means  that  the  OLS- 
estimator  converges,  if  the  true  value  of  (p  equals  one,  at  a  higher  rate  to  cp  =  1 .  This 
property  is  known  as  superconsistency. 


6The  densities  were  estimated  using  an  adaptive  kernel  density  estimator  with  Epanechnikov 
window  (see  Silverman  (1986)). 


7.2  Properties  of  the  OLS  Estimator  in  the  Case  of  Integrated  Variables 


143 


In  order  to  understand  this  result  better,  in  particular  in  the  light  of  the  derivation 
in  the  Appendix  of  Sect.  5.2,  we  take  a  closer  look  at  the  asymptotic  distribution  of 


’  ($t  -</>): 


£f=t^-t 


Under  the  assumption  cf>  =  l,  X,  becomes  a  random  walk  so  that  X,  can  be  written 
as  Xt  =  Z,  +  . . .  +  Z\ .  Moreover,  as  a  sum  of  normally  distributed  random  variables 
X,  becomes  itself  normally  distributed  as  X,  ~  N(0,  a 2t).  In  addition,  we  get 


x; 2  =  (z,_!  +  z,f  =  x2_x  +  2x,-\Zt  +  z? 
=►  Xt-iZ,  =  ( X 2  -  xf_t  -  Z2)  /2 


J2X>-'Z>  = 


X2T-X l  ZLZ? 


t=  1 


1  1 


t=  1 


X; 


£f=i  A 


l  A  i  (  xT  v 


1  ErLt  Z? 

2a2  T 


—  »■ 


The  numerator  therefore  converges  to  a  /2  distribution.  The  distribution  of  the 
denominator  is  more  involved,  but  its  expected  value  is  given  by: 


=tf2  £(r-D  = 


o2T(T  —  1) 


t=  l 


t=  l 


because  Z,_i  ~  N  (0,  a2(t  —  1)).  To  obtain  a  nondegenerate  random  variable  one 
must  scale  by  T2.  Thus,  intuitively,  7  (<r/>/ — ([>  )  will  no  longer  converge  to  a  degenerate 
distribution. 

Using  similar  arguments  it  can  be  shown  that  the  t-statistic 


4>t  —  1 


4>t  —  1 


with  s\  =  jZ  Zt=2  [Xt  —  $tX,-\J  is  not  asymptotically  normal.  Its  distribution 
was  also  first  tabulated  by  Fuller  (1976).  Figure  7.2  compares  its  density  with  the 
standard  normal  distribution  in  a  Monte-Carlo  experiment  using  again  a  sample  of 
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t— statistic 

Fig.  7.2  Distribution  of  t-statistic  for  T  =  100  and  10,000  replications  and  standard  normal 
distribution 


T  =  100  and  10, 000  replications.  It  is  obvious  that  the  t-distribution  is  shifted  to 
the  left.  This  implies  that  the  critical  values  will  be  absolutely  higher  than  for  the 
standard  case.  In  addition,  one  may  observe  a  slight  skewness. 

Finally,  we  also  want  to  investigate  the  autocovariance  function  of  a  random 
walk.  Using  similar  arguments  as  in  Sect.  1.3  we  get: 


y(h)  —  E(XtXt-ii ) 

=  E  [(ZT  +  Zr_!  +  . . .  +  Z0  ( ZT-h  +  ZT-h-X  +  . . .  +  Zj)] 

=  E  ( Z2T_h  +  Z2_h_x  +  . . .  +  Zf)  =  (T  -  h)a2. 

Thus  the  correlation  coefficient  between  Xj  and  Xr-/i  is: 

y(h)  T-h  IT  —  h 

p(h)  =  — - - n  /  =  =J - ,  h  <  T. 

VVZrVVZr-A  \JT{T  —  h)  V  T 

The  autocorrelation  coefficient  p(h)  therefore  monotonically  decreases  with  /;, 
holding  the  sample  size  T  constant.  The  rate  at  which  p(h)  falls  is,  however,  smaller 
than  for  ARMA  processes  for  which  p(h)  declines  exponentially  fast  to  zero.  Given 
h,  the  autocorrelation  coefficient  converges  to  one  for  T  —>  oo.  Figure  7.3  compares 
the  theoretical  and  the  estimated  ACF  of  a  simulated  random  walk  with  T  =  100. 
Typically,  the  estimated  coefficients  lie  below  the  theoretical  ones.  In  addition,  we 
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Fig.  7.3  ACF  of  a  random  walk  with  100  observations 


show  the  estimated  ACF  for  an  AR(1)  process  with  </>  =  0.9  and  using  the  same 
realizations  of  the  white  noise  process  as  in  the  construction  of  the  random  walk. 
Despite  the  large  differences  between  the  ACF  of  an  AR(1)  process  and  a  random 
walk,  the  ACF  is  only  of  limited  use  to  discriminate  between  an  (stationary)  ARMA 
process  and  a  random  walk. 

The  above  calculation  also  shows  that  p(l)  <  1  so  that  the  expected  value  of  the 
OLS  estimator  is  downward  biased  in  finite  samples:  E <pj  <  1. 


7.3  Unit-Root  Tests 

The  previous  Sects.  7.1  and  7.2  have  shown  that,  depending  on  the  nature  of  the 
non-stationarity  (trend  versus  difference  stationarity),  the  stochastic  process  has 
quite  different  algebraic  (forecast,  forecast  error  variance,  persistence)  and  statistical 
(asymptotic  distribution  of  OLS-estimator)  properties.  It  is  therefore  important  to  be 
able  to  discriminate  among  these  two  different  types  of  processes.  This  also  pertains 
to  standard  regression  models  for  which  the  presence  of  integrated  variables  can  lead 
to  non-normal  asymptotic  distributions. 

The  ability  to  differentiate  between  trend-  and  difference-stationary  processes  is 
not  only  important  from  a  statistical  point  of  view,  but  can  be  given  an  economic 
interpretation.  In  macroeconomic  theory,  monetary  and  demand  disturbances  are 
alleged  to  have  only  temporary  effects  whereas  supply  disturbances,  in  particular 
technology  shocks,  are  supposed  to  have  permanent  effects.  To  put  it  in  the  language 
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of  time  series  analysis:  monetary  and  demand  shocks  have  a  persistence  of  zero 
whereas  supply  shocks  have  nonzero  (positive)  persistence.  Nelson  and  Plosser 
(1982)  were  the  first  to  investigate  the  trend  properties  of  economic  time  series 
from  this  angle.  In  their  influential  study  they  reached  the  conclusion  that,  with  the 
important  exception  of  the  unemployment  rate,  most  economic  time  series  in  the 
US  are  better  characterized  as  being  difference  stationary.  Although  this  conclusion 
came  under  severe  scrutiny  (see  Cochrane  (1988)  and  Campbell  and  Perron  (1991)), 
this  issue  resurfaces  in  many  economic  debates.  The  latest  discussion  relates  to  the 
nature  and  effect  of  technology  shocks  (see  Gall  (1999)  or  Christiano  et  al.  (2003)). 

The  following  exposition  focuses  on  the  Dickey-Fuller  test  (DF-test)  and  the 
Phillips-Perron  test(PP-test).  Although  other  test  procedures  and  variants  thereof 
have  been  developed  in  the  meantime,  these  two  remain  the  most  widely  applied  in 
practice.  These  types  of  tests  are  also  called  unit-root  tests. 

Both  the  DF-  as  well  as  the  PP-test  rely  on  a  regression  of  X,  on  X,-\  which  may 
include  further  deterministic  regressors  like  a  constant  or  a  linear  time  trend.  We 
call  this  regression  the  Dickey-Fuller  regression : 


X,  = 


deterministic 

variables 


+  0-X/-1  +  Z,. 


(7.1) 


Alternatively  and  numerically  equivalent,  one  may  run  the  Dickey-Fuller  regression 
in  difference  form: 


AX,  = 


deterministic 

+  fix,-!  +  Z , 

variables 


with  /3  =  (f)  —  1 .  For  both  tests,  the  null  hypothesis  is  that  the  process  is  integrated 
of  order  one,  difference  stationary,  or  has  a  unit-root.  Thus  we  have 


H0  :  <p  =  1  or  0=0. 


The  alternative  hypothesis  H  |  is  that  the  process  is  trend- stationary  or  stationary 
with  constant  mean  and  is  given  by: 


Hi  :  — 1  <  0  <  1  or  — 2</3  =  (f>— l<0. 

Thus  the  unit  root  test  is  a  one-sided  test.  The  advantage  of  the  second  formulation 
of  the  Dickey-Fuller  regression  is  that  the  corresponding  t-statistic  can  be  readily 
read  off  from  standard  outputs  of  many  computer  packages  which  makes  additional 
computations  unnecessary. 


7.3  Unit-Root  Tests 


147 


7.3.1  The  Dickey-Fuller  Test  (DF-Test) 

The  Dickey-Fuller  test  comes  in  two  forms.  The  first  one,  sometimes  called  the  p- 
test,  takes  T(cf>  —  1)  as  the  test  statistic.  As  shown  previously,  this  statistic  is  no 
longer  asymptotically  normally  distributed.  However,  it  was  first  tabulated  by  Fuller 
and  can  be  found  in  textbooks  like  Fuller  (1976)  or  Hamilton  (1994b).  The  second 
and  much  more  common  one  relies  on  the  usual  t-statistic  for  the  hypothesis  0  =  1: 

=  (07-  l)/6$. 

This  test-statistic  is  also  not  asymptotically  normally  distributed.  It  was  for  the  first 
time  tabulated  by  Fuller  (1976)  and  can  be  found,  for  example,  in  Hamilton  (1994b). 
Later  MacKinnon  (1991)  presented  much  more  detailed  tables  where  the  critical 
values  can  be  approximated  for  any  sample  size  T  by  using  interpolation  formulas 
(see  also  Banerjee  et  al.  (1993)). 7 

The  application  of  the  Dickey-Fuller  test  as  well  as  the  Phillips-Perron  test  is 
obfuscated  by  the  fact  that  the  asymptotic  distribution  of  the  test  statistic  ( p -  or  t- 
test)  depends  on  the  specification  of  the  deterministic  components  and  on  the  true 
data  generating  process.  This  implies  that  depending  on  whether  the  Dickey-Fuller 
regression  includes,  for  example,  a  constant  and/or  a  time  trend  and  on  the  nature 
of  the  true  data  generating  process  one  has  to  use  different  tables  and  thus  different 
critical  values.  In  the  following  we  will  focus  on  the  most  common  cases  listed  in 
Table  7.1. 

In  case  1  the  Dickey-Fuller  regression  includes  no  deterministic  component. 
Thus,  a  rejection  of  the  null  hypothesis  implies  that  {A,}  has  to  be  a  mean  zero 
stationary  process.  This  specification  is,  therefore,  only  warranted  if  one  can  make 
sure  that  the  data  have  indeed  mean  zero.  As  this  is  rarely  the  case,  except,  for 
example,  when  the  data  are  the  residuals  from  a  previous  regression,8  case  1  is 


Table  7.1  The  four  most  important  cases  for  the  unit-root  test 


Data  generating  process 
(null  hypothesis) 

Estimated  regression 

T($  -  1) 

p-  test: 

(Dickey-Fuller  regression) 

t-test 

A,  =  X,-!  +  Z, 

X,  =  0X,!  +  z, 

Case  1 

Case  1 

X,  =  X +  z, 

X,  =  a  +  0A,!  +  Z, 

Case  2 

Case  2 

X,  =  a+  X,_i  +  Zf, 
a/0 

X,  =  a  +  <pXt—i  +  Z, 

N(0,1) 

X,  =  a+  X +  Z, 

X,  =  a  +  St 

+4>Xt-\  +  z, 

Case  4 

Case  4 

7These  interpolation  formula  are  now  implemented  in  many  software  packages,  like  EVIEWS,  to 
compute  the  appropriate  critical  values. 

8This  fact  may  pose  a  problem  by  itself. 
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very  uncommon  in  practice.  Thus,  if  the  data  do  not  display  a  trend,  which  can 
be  checked  by  a  simple  time  plot,  the  Dickey-Fuller  regression  should  include  a 
constant.  A  rejection  of  the  null  hypothesis  then  implies  that  {X,}  is  a  stationary 
process  with  mean  /i  =  .  If  the  data  display  a  time  trend,  the  Dickey-Fuller 

regression  should  also  include  a  linear  time  trend  as  in  case  4.  A  rejection  of  the 
null  hypothesis  then  implies  that  the  process  is  trend-stationary.  In  the  case  that  the 
Dickey-Fuller  regression  contains  no  time  trend  and  there  is  no  time  trend  under  the 
alternative  hypothesis,  asymptotic  normality  holds.  This  case  is  only  of  theoretical 
interest  as  it  should  a  priori  be  clear  whether  the  data  are  trending  or  not.  In  the 
instance  where  one  is  not  confident  about  the  trending  nature  of  the  time  series  see 
the  procedure  outlined  in  Sect.  7.3.3. 

In  the  cases  2  and  4  it  is  of  interest  to  investigate  the  joint  hypothesis  Ho  :  a  =  0 
and  0  =  1,  and  Ho  :  8  =  0  and  0  =  1  respectively.  Again  the  corresponding 
F-statistic  is  no  longer  F-distributed,  but  has  been  tabulated  (see  Hamilton  (1994b, 
Table  B7 )).  The  trade-off  between  t-  and  F-test  is  discussed  in  Sect.  7.3.3. 

Most  economic  time  series  display  a  significant  amount  of  autocorrelation. 
To  take  this  feature  into  account  it  is  necessary  to  include  lagged  differences 

AX,-i _ _  AX,-p+ 1  as  additional  regressors.  The  so  modified  Dickey-Fuller 

regression  then  becomes: 


Xt  = 


deterministic 

variables 


+  <pXt- 1  +  Y\  AXr-i  +  •  •  •  +  Yp-i  AX,-p+i  +  Z,. 


This  modified  test  is  called  the  augmented  Dickey-Fuller  test  (ADF-test).  This 
autoregressive  correction  does  not  change  the  asymptotic  distribution  of  the  test 
statistics.  Thus  the  same  tables  can  be  used  as  before.  For  the  coefficients  of  the 
autoregressive  terms  asymptotic  normality  holds.  This  implies  that  the  standard 
testing  procedures  (t-test,  F-test)  can  be  applied  in  the  usual  way.  This  is  true  if 
instead  of  autoregressive  correction  terms  moving-average  terms  are  used  instead 
(see  Said  and  Dickey  (1984)). 

For  the  ADF-test  the  order  p  of  the  model  should  be  chosen  such  that  the  residuals 
are  close  to  being  white  noise.  This  can  be  checked,  for  example,  by  looking  at 
the  ACF  of  the  residuals  or  by  carrying  out  a  Ljung-Box  test  (see  Sect.  4.2).  In 
case  of  doubt,  it  is  better  to  choose  a  higher  order.  A  consistent  procedure  to  find 
the  right  order  is  to  use  the  Akaike’s  criterion  (AIC).  Another  alternative  strategy 
advocated  by  Ng  and  Perron  (1995)  is  an  iterative  testing  procedure  which  makes 
use  of  the  asymptotic  normality  of  the  autoregressive  correction  terms.  Starting 
from  a  maximal  order  p  —  1  =  pmax,  the  method  amounts  to  the  test  whether  the 
coefficient  corresponding  to  the  highest  order  is  significantly  different  from  zero.  If 
the  null  hypothesis  that  the  coefficient  is  zero  is  not  rejected,  the  order  of  the  model 
is  reduced  by  one  and  the  test  is  repeated.  This  is  done  as  long  as  the  null  hypothesis 
is  not  rejected.  If  the  null  hypothesis  is  finally  rejected,  one  sticks  with  the  model 
and  performs  the  ADF-test.  The  successive  test  are  standard  t-tests.  It  is  advisable  to 
use  a  rather  high  significance  level,  for  example  a  10  %  level.  The  simulation  results 
by  Ng  and  Perron  (1995)  show  that  this  procedure  leads  to  a  smaller  bias  compared 
to  using  the  AIC  criterion  and  that  the  reduction  in  power  remains  negligible. 
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7.3.2  The  Phillips-Perron  Test  (PP-Test) 


The  Phillips-Perron  test  represents  a  valid  alternative  to  the  ADF-test.  It  is  based  on 
the  simple  Dickey-Fuller  regression  (without  autoregressive  correction  terms)  and 
corrects  for  autocorrelation  by  modifying  the  OLS-estimate  or  the  corresponding 
value  of  the  t-statistic.  The  simple  Dickey-Fuller  regression  with  either  constant 
and/or  trend  is: 


X,  = 


deterministic 

variables 


+  <pXt-  i  +  Z;, 


where  {Z,}  need  no  longer  be  a  white  noise  process,  but  can  be  any  mean  zero 
stationary  process.  {Z,}  may,  for  example,  be  an  ARMA  process.  In  principle,  the 
approach  also  allows  for  heteroskedasticity.9 

The  first  step  in  the  Phillips-Perron  unit-root  test  estimates  the  above  appropri¬ 
ately  specified  Dickey-Fuller  regression.  The  second  step  consists  in  the  estimation 
of  the  unconditional  variance  yz( 0)  and  the  long-run  variance  J  of  the  residuals 
Z,.  This  can  be  done  using  one  of  the  methods  prescribed  in  Sect.  4.4.  These  two 
estimates  are  then  used  in  a  third  step  to  correct  the  p-  and  the  t-test  statistics.  This 
correction  would  then  take  care  of  the  autocorrelation  present  in  the  data.  Finally, 
one  can  use  the  so  modified  test  statistics  to  carry  out  the  unit-root  test  applying  the 
same  tables  for  the  critical  values  as  before. 

In  case  1  where  no  deterministic  components  are  taken  into  account  (see  case  1 
in  Table  7.1)  the  modified  test  statistics  according  to  Phillips  (1987)  are: 


p-Test : 


t-Test : 


'(M-j(Tr-fe(0))(T 


-1 


Jr 


r=i 


-1/2 


If  {Z,}  would  be  white  noise  so  that  J  =  y(0),  respectively  J7  ss  yz( 0)  one  gets 
the  ordinary  Dickey-Fuller  test  statistic.  Similar  formulas  can  be  derived  for  the 
cases  2  and  4.  As  already  mentioned  these  modifications  will  not  alter  the  asymptotic 
distributions  so  the  same  critical  values  as  for  the  ADF-test  can  be  used. 

The  main  advantage  of  the  Phillips-Perron  test  is  that  the  non-parametric  correc¬ 
tion  allows  for  very  general  {Z,}  processes.  The  PP-test  is  particularly  appropriate  if 
{Z,}  has  some  MA-components  which  can  be  only  poorly  approximated  by  low 
order  autoregressive  terms.  Another  advantage  is  that  one  can  avoid  the  exact 
modeling  of  the  process.  It  has  been  shown  by  Monte-Carlo  studies  that  the  PP-test 
has  more  power  compared  to  the  DF-test,  i.e.  the  PP-test  rejects  the  null  hypothesis 
more  often  when  it  is  false,  but  that,  on  the  other  hand,  it  has  also  a  higher  size 
distortion,  i.e.  that  it  rejects  the  null  hypothesis  too  often. 


sThe  exact  assumptions  can  be  read  in  Phillips  (1987)  and  Phillips  and  Perron  (1988). 
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7.3.3  Unit-Root  Test:  Testing  Strategy 

Independently  whether  the  Dickey-Fuller  or  the  Phillips-Perron  test  is  used,  the 
specification  of  the  deterministic  component  is  important  and  can  pose  a  problem  in 
practice.  On  the  one  hand,  if  the  deterministic  part  is  underrepresented,  for  example 
when  only  a  constant,  but  no  time  trend  is  used,  the  test  results  are  biased  in  favor  of 
the  null  hypothesis,  if  the  data  do  indeed  have  a  trend.  On  the  other  hand,  if  too  many 
deterministic  components  are  used,  the  power  of  the  test  is  reduced.  It  is  therefore 
advisable  to  examine  a  plot  of  the  series  in  order  to  check  whether  a  long  run  trend 
is  visible  or  not.  In  some  circumstances  economic  reasoning  may  help  in  this  regard. 

Sometimes,  however,  it  is  difficult  to  make  an  appropriate  choice  a  priori.  We 
therefore  propose  the  following  testing  strategy  based  on  Elder  and  Kennedy  (2001). 

A t  has  a  long-run  trend:  As  Xt  grows  in  the  long-run,  the  Dickey-Fuller  regres¬ 

sion 


Xt  —  a  8t  - 1-  (j)Xf—\  Zj 

should  contain  a  linear  trend.10  In  this  case  either  0  =  1,<5  =  0  and  a  ^  0  (unit 
root  case)  or  0  <  1  with  8^0  (trend  stationary  case).  We  can  then  test  the  joint 
null  hypothesis 


H0  :  0  =  1  and  8  =  0 

by  a  corresponding  F-test.  Note  that  the  F-statistic,  like  the  t-test,  is  not 
distributed  according  to  the  F-distribution.  If  the  test  does  not  reject  the  null,  we 
conclude  that  {X,}  is  a  unit  root  process  with  drift  or  equivalently  a  difference¬ 
stationary  (integrated)  process.  If  the  F-test  rejects  the  null  hypothesis,  there  are 
three  possible  situations: 

(i)  The  possibility  0  <  1  and  8  =  0  contradicts  the  primary  observation  that 
{ X, !  has  a  trend  and  can  therefore  be  eliminated. 

(ii)  The  possibility  0=1  and  8^0  can  also  be  excluded  because  this  would 
imply  that  {A,}  has  a  quadratic  trend,  which  is  unrealistic. 

(iii)  The  possibility  0  <  1  and  8^0  represents  the  only  valid  alternative.  It 
implies  that  {A,}  is  stationary  around  a  linear  trend,  i.e.  that  {A,}  is  trend¬ 
stationary. 

Similar  conclusions  can  be  reached  if,  instead  of  the  F-test,  a  t-test  is  used  to  test 
the  null  hypothesis  Ho  :  0  =  1  against  the  alternative  H;  :  0  <  1.  Thereby  a 
non-rejection  of  Ho  is  interpreted  that  8  =  0.  If,  however,  the  null  hypothesis 
H0  is  rejected,  this  implies  that  8^0,  because  {A,}  exhibits  a  long-run  trend. 


10In  case  of  the  ADF-test  additional  regressors,  AX t—j,j  >  0,  might  be  necessary. 
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The  F-test  is  more  powerful  than  the  t-test.  The  t-test,  however,  is  a  one-sided 
test,  which  has  the  advantage  that  it  actually  corresponds  to  the  primary  objective 
of  the  test.  In  Monte-Carlo  simulations  the  t-test  has  proven  to  be  marginally 
superior  to  the  F-test. 

X has  no  long-run  trend:  In  this  case  <5  =  0  and  the  Dickey-Fuller  regression 
should  be  run  without  a  trend 1 1 : 


X,  —  a  +  cf>Xr-\  +  Z,. 


Thus  we  have  either  0  =  1  and  a  =  0  or  0  <  1  and  a/0.  The  null  hypothesis 
in  this  case  therefore  is 


Ho  :  0=1  and  a  =  0. 

A  rejection  of  the  null  hypothesis  can  be  interpreted  in  three  alternative  ways: 

(i)  The  case  0  <  1  and  a  =  0  can  be  eliminated  because  it  implies  that  { X, } 
would  have  a  mean  of  zero  which  is  unrealistic  for  most  economic  time 
series. 

(ii)  The  case  0  =  1  and  o:/0  can  equally  be  eliminated  because  it  implies  that 
{X,}  has  a  long-run  trend  which  contradicts  our  primary  assumption. 

(iii)  The  case  0  <  1  and  a  ^  0  is  the  only  realistic  alternative.  It  implies  that 
the  time  series  is  stationary  around  a  constant  mean  given  by 

As  before  one  can  use,  instead  of  a  F-test,  a  t-test  of  the  null  hypothesis  Ho  : 
0  =  1  against  the  alternative  hypothesis  Hi  :  0  <  1.  If  the  null  hypothesis  is  not 
rejected,  we  interpret  this  to  imply  that  a  =  0.  If,  however,  the  null  hypothesis 
H0  is  rejected,  we  conclude  that  a  ^  0.  Similarly,  Monte-Carlo  simulations  have 
proven  that  the  t-test  is  superior  to  the  F-test. 

The  trend  behavior  of  X,  is  uncertain:  This  situation  poses  the  following  prob¬ 
lem.  Should  the  data  exhibit  a  trend,  but  the  Dickey-Fuller  regression  contains 
no  trend,  then  the  test  is  biased  in  favor  of  the  null  hypothesis.  If  the  data  have 
no  trend,  but  the  Dickey-Fuller  regression  contains  a  trend,  the  power  of  the  test 
is  reduced.  In  such  a  situation  one  can  adapt  a  two-stage  strategy.  Estimate  the 
Dickey-Fuller  regression  with  a  linear  trend: 


X/  —  a  +  8t  +  (j)Xt—\  -\-  Z,. 


Use  the  t-test  to  test  the  null  hypothesis  Ho  :  0  =  1  against  the  alternative 
hypothesis  Hi  :  0  <  1 .  If  Ho  is  not  rejected,  we  conclude  the  process  has  a  unit 
root  with  or  without  drift.  The  presence  of  a  drift  can  then  be  investigated  by 
a  simple  regression  of  AX,  against  a  constant  followed  by  a  simple  t-test  of  the 


11  In  case  of  the  ADF-test  additional  regressors,  A  X,—j,j  >  0,  might  be  necessary. 
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null  hypothesis  that  the  constant  is  zero  against  the  alternative  hypothesis  that  the 
constant  is  nonzero.  As  AXt  is  stationary,  the  usual  critical  values  can  be  used.  12 
If  the  t-test  rejects  the  null  hypothesis  Ho,  we  conclude  that  there  is  no  unit  root. 
The  trend  behavior  can  then  be  investigated  by  a  simple  t-test  of  the  hypothesis 
Ho  :  8  =  0.  In  this  test  the  usual  critical  values  can  be  used  as  {X,}  is  already 
viewed  as  being  stationary. 


7.3.4  Examples  of  Unit-Root  Tests 

As  our  first  example,  we  examine  the  logged  real  GDP  for  Switzerland,  1  n ( B I P, ) , 
where  we  have  adjusted  the  series  for  seasonality  by  taking  a  moving-average. 
The  corresponding  data  are  plotted  in  Fig.  1.3.  As  is  evident  from  this  plot,  this 
variable  exhibits  a  clear  trend  so  that  the  Dickey-Fuller  regression  should  include 
a  constant  and  a  linear  time  trend.  Moreover,  {Aln(BIPf)}  is  typically  highly 
autocorrelated  which  makes  an  autoregressive  correction  necessary.  One  way  to 
make  this  correction  is  by  augmenting  the  Dickey-Fuller  regression  by  lagged 
{AlnfBIP,)}  as  additional  regressors.  Thereby  the  number  of  lags  is  determined 
by  AIC.  The  corresponding  result  is  reported  in  the  first  column  of  Table  7.2.  It 
shows  that  AIC  chooses  only  one  autoregressive  correction  term.  The  value  of  t- 
test  statistic  is  —3.110  which  is  just  above  the  5-%  critical  value.  Thus,  the  null 
hypothesis  is  not  rejected.  If  the  autoregressive  correction  is  chosen  according  to 
the  method  proposed  by  Ng  and  Perron  five  autoregressive  lags  have  to  be  included. 
With  this  specification,  the  value  of  the  t-test  statistic  is  clearly  above  the  critical 
value,  implying  that  the  null  hypothesis  of  the  presence  of  a  unit  root  cannot 
be  rejected  (see  second  column  in  Table  7.2). 13  The  results  of  the  ADF-tests  is 
confirmed  by  the  PP-test  (column  3  in  Table  7.2)  with  quadratic  spectral  kernel 
function  and  band  width  20.3  chosen  according  to  Andrews’  formula  (see  Sect.  4.4). 

The  second  example,  examines  the  three-month  LIBOR,  {R3Mf}.  The  series  is 
plotted  in  Fig.  1.4.  The  issue  whether  this  series  has  a  linear  trend  or  not  is  not  easy 
to  decide.  On  the  one  hand,  the  series  clearly  has  a  negative  trend  over  the  sample 
period  considered.  On  the  other  hand,  a  negative  time  trend  does  not  make  sense 
from  an  economic  point  of  view  because  interest  rates  are  bounded  from  below 
by  zero.  Because  of  this  uncertainty,  it  is  advisable  to  include  in  the  Dickey-Fuller 
regression  both  a  constant  and  a  trend  to  be  on  the  safe  side.  Column  5  in  Table  7.2 
reports  the  corresponding  results.  The  value  of  the  t-statistic  of  the  PP-test  with 
Bartlett  kernel  function  and  band  width  of  5  according  to  the  Newey-West  rule  of 
thumb  is  —2.142  and  thus  higher  than  the  corresponding  5-%  critical  of  —3.435. 


12Eventually,  one  must  correct  the  corresponding  standard  deviation  by  taking  the  autocorrelation 
in  the  residual  into  account.  This  can  be  done  by  using  the  long-run  variance.  In  the  literature  this 
correction  is  known  as  the  Newey-West  correction. 

13The  critical  value  changes  slightly  because  the  inclusion  of  additional  autoregressive  terms 
changes  the  sample  size. 
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Table  7.2  Examples  of  unit  root  tests 


ln(BIP,) 

ln(BIP,) 

ln(BIP<) 

R3M, 

R3M, 

Test 

ADF 

ADF 

PP 

PP 

PP 

Autoregressive 

correction 

AIC 

Ng  and 

Perron 

Quadratic 

spectral 

Bartlett 

Bartlett 

Band  width 

20.3 

5 

5 

a 

0.337 

0.275 

0.121 

0.595 

-0.014 

s 

0.0001 

0.0001 

0.0002 

-0.0021 

<p 

0.970 

0.975 

0.989 

0.963 

-0.996 

Yi 

0.885 

1.047 

n 

-0.060 

Y3 

-0.085 

Y4 

-0.254 

Ys 

0.231 

-3.110 

-2.243 

-1.543 

-2.142 

-0.568 

Critical  value 

-3.460 

-3.463 

-3.460 

-3.435 

-2.878 

(5%) 

Critical  values  from  MacKinnon  (1996) 


Thus,  we  cannot  reject  the  null  hypothesis  of  the  presence  of  a  unit  root.  We 
therefore  conclude  that  the  process  {R3M,}  is  integrated  of  order  one,  respectively 
difference-stationary.  Based  on  this  conclusion,  the  issue  of  the  trend  can  now  be 
decided  by  running  a  simple  regression  of  AR3M,  against  a  constant.  This  leads  to 
the  following  results: 


AR3M,  =  -0.0315  +  et. 

(0.0281) 

where  et  denotes  the  least-squares  residual.  The  mean  of  AR3M,  is  therefore 
—0.03 1 5.  This  value  is,  however,  statistically  not  significantly  different  from  zero  as 
indicated  by  the  estimated  standard  error  in  parenthesis.  Note  that  this  estimate  of 
the  standard  error  has  been  corrected  for  autocorrelation  (Newey-West  correction). 
Thus,  {R3Mf}  is  not  subject  to  a  linear  trend.  One  could  have  therefore  run  the 
Dickey-Fuller  regression  without  the  trend  term.  The  result  of  corresponding  to  this 
specification  is  reported  in  the  last  column  of  Table  7.2.  It  confirms  the  presence  of 
a  unit  root. 


7.4  Generalizations  of  Unit-Root  Tests 
7.4.1  Structural  Breaks  in  the  Trend  Function 

As  we  have  seen,  the  unit-root  test  depends  heavily  on  the  correct  specification  of 
the  deterministic  part.  Most  of  the  time  this  amounts  to  decide  whether  a  linear 
trend  is  present  in  the  data  or  not.  In  the  previous  section  we  presented  a  rule  how 
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a  b 


c 


Fig.  7.4  Three  types  of  structural  breaks  at  TB.  (a)  Level  shift,  (b)  Change  in  slope,  (c)  Level  shift 
and  change  in  slope 


to  proceed  in  case  of  uncertainty  about  the  trend.  Sometimes,  however,  the  data 
exhibit  a  structural  break  in  their  deterministic  component.  If  this  structural  break  is 
ignored,  the  unit-root  test  is  biased  in  favor  of  the  null  hypothesis  (i.  e.  in  favor  of  a 
unit  root)  as  demonstrated  by  Perron  (1989).  Unfortunately,  the  distribution  of  the 
test  statistic  under  the  null  hypothesis,  in  our  case  the  t-statistic,  depends  on  the  exact 
nature  of  the  structural  break  and  on  its  date  of  occurrence  in  the  data.  Following 
Perron  (1989)  we  concentrate  on  three  exemplary  cases:  a  level  shift,  a  change  in  the 
slope  (change  in  the  growth  rate),  and  a  combination  of  both  possibilities.  Figure  7.4 
shows  the  three  possibilities  assuming  that  a  break  occurred  in  period  7# .  Thereby 
an  AR(1)  process  with  tp  =  0.8  was  superimposed  on  the  deterministic  part. 

The  unit-root  test  with  the  possibility  of  a  structural  break  in  period  Tb  is 
carried  out  using  the  Dickey-Fuller  test.  Thereby  the  date  of  the  structural  break  is 
assumed  to  be  known.  This  assumption,  although  restrictive,  is  justifiable  in  many 
applications.  The  first  oil  price  shock  in  1973  or  the  German  reunification  in  1989 
are  examples  of  structural  breaks  which  can  be  dated  exactly.  Other  examples  would 
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Table  7.3  Dickey-Fuller  regression  allowing  for  structural  breaks 


Model  A:  Level  Shift 

Ho  : 

Xt  =  a  +  l(,=rB+o5B  +  Xt—  i  +  Z, 

H!  : 

Xt  =  a  +  St  +  horjife  —  oi)  +  (j>Xt—\  +  Zf,  cf>  <  1 

Model  B :  Change  in  Slope  (Change  in  Growth  Rate) 

Ho  : 

X,  =  a  +  l{r>ra}(o'B  —  ff)  +  X,-i  +  Z, 

Hi  : 

X,  =  a  +  St  +  1{i>tb}(Sb  —  S)(t  —  Tg)  +  <fXt—  i  +  Z(,  <p  <  1 

Model  C:  Level  Shift  and  Change  in  Slope 

H0  : 

X,  =  a  +  l(,=7i,+i<  SB  +  l{i>r b}(“b  —  ff)  +  Xt—t  +  Z, 

Hi  : 

X,  =  a  +  St  +  l{t>TB\(otB  ~  “)  +  IjorslC^B  —  $)(t  —  TB)  +<pX,—l  +  Z,.  <p  <  1 

l{;=7-a+i}  and  l{i>rB}  denotes  the  indicator  function  which  takes  the  value  one  if  the 
condition  is  satisfied  and  the  value  zero  otherwise 


include  changes  in  the  way  the  data  are  constructed.  These  changes  are  usually 
documented  by  the  data  collecting  agencies.  Table  7.3  summarizes  the  three  variants 
of  Dickey-Fuller  regression  allowing  for  structural  breaks.14 

Model  A  allows  only  for  a  level  shift.  Under  the  null  hypothesis  the  series 
undergoes  a  one-time  shift  at  time  Tg .  This  level  shift  is  maintained  under  the  null 
hypothesis  which  posits  a  random  walk.  Under  the  alternative,  the  process  is  viewed 
as  being  trend-stationary  whereby  the  trend  line  shifts  parallel  by  ag  —  a  at  time 
Tg.  Model  B  considers  a  change  in  the  mean  growth  rate  from  a  to  ag  at  time  Tg. 
Under  the  alternative,  the  slope  of  time  trend  changes  from  8  to  8g.  Model  C  allows 
for  both  types  of  break  to  occur  at  the  same  time. 

The  unit-root  test  with  possible  structural  break  for  a  time  series  X, , 

t  =  0, 1 . T,  is  implemented  in  two  stages  as  follows.  In  the  first  stage,  we 

regress  X,  on  the  corresponding  deterministic  component  using  OLS.  The  residuals 
Xq,  X\ . Xg  from  this  regression  are  then  used  to  carry  out  a  Dickey-Fuller  test: 

x,  =  <fxt-,  +  z„  t=  i,...,r. 

The  distribution  of  the  corresponding  t-statistic  under  the  null  hypothesis  depends 
not  only  on  the  type  of  the  structural  break,  but  also  on  the  relative  date  of  the  break 
in  the  sample.  Let  this  relative  date  be  parameterized  by  A  =  Tg/T.  The  asymptotic 
distribution  of  the  t-statistic  has  been  tabulated  by  Perron  (1989).  This  table  can  be 
used  to  determine  the  critical  values  for  the  test.  These  critical  values  are  smaller 
than  those  from  the  normal  Dickey-Fuller  table.  Using  a  5  %  significance  level,  the 
critical  values  range  between  —3.80  and  —3.68  for  model  A,  between  —3.96  and 
—3.65  for  model  B,  and  between  —4.24  and  —3.75  for  model  C,  depending  on  the 
value  of  A.  These  values  also  show  that  the  dependence  on  A  is  only  weak. 

In  the  practical  application  of  the  test  one  has  to  control  for  the  autocorrelation  in 
the  data.  This  can  be  done  by  using  the  Augmented  Dickey-Fuller  (ADF)  test.  This 


14See  Eq.  (7.1)  and  Table  7.1  for  comparison. 
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amounts  to  the  introduction  of  A t  =  1,2 . p  —  1 ,  as  additional  regressors 

in  the  above  Dickey-Fuller  regression.  Thereby  the  order  p  can  be  determined  by 
Akaike’s  information  criterion  (AIC)  or  by  the  iterative  testing  procedure  of  Ng  and 
Perron  (1995).  Alternatively,  one  may  use,  instead  of  the  ADF  test,  the  Phillips- 
Perron  test.  In  this  case  one  computes  the  usual  t-statistic  for  the  null  hypothesis 
0  =  1  and  corrects  it  using  the  formulas  in  Phillips  and  Perron  (1988)  as  explained 
in  Sect.  7.3.2.  Which  of  the  two  methods  is  used,  is  irrelevant  for  the  determination 
of  the  critical  values  which  can  be  extracted  from  Perron  (1989). 

Although  it  may  be  legitimate  in  some  cases  to  assume  that  the  time  of  the 
structural  break  is  known,  we  cannot  take  this  for  granted.  It  is  therefore  important 
to  generalize  the  test  allowing  for  an  unknown  date  for  the  occurrence  of  a  structural 
break.  The  work  of  Zivot  and  Andrews  (1992)  has  shown  that  the  procedure 
proposed  by  Perron  can  be  easily  expanded  in  this  direction.  We  keep  the  three 
alternative  models  presented  in  Table  7.3,  but  change  the  null  hypothesis  to  a  random 
walk  with  drift  with  no  exogenous  structural  break.  Under  the  null  hypothesis,  {A,} 
is  therefore  assumed  to  be  generated  by 

X,  =  a  +  X,-i  +  Z,,  Z,  ~  WN(0,  a2). 

The  time  of  the  structural  TB,  respectively  A  =  TB/T,  is  estimated  in  such  a  way  that 
{A,}  comes  as  close  as  possible  to  a  trend-stationary  process.  Under  the  alternative 
hypothesis  {A,}  is  viewed  as  a  trend- stationary  process  with  unknown  break  point. 
The  goal  of  the  estimation  strategy  is  to  chose  TB,  respectively  A,  in  such  a  way 
that  the  trend- stationary  alternative  receives  the  highest  weight.  Zivot  and  Andrews 
(1992)  propose  to  estimate  A  by  minimizing  the  value  of  the  t-statistic  t^( A)  under 
the  hypothesis  0=1: 

U(Ainf)  =  inf  fr  (A)  (7.2) 

r  XeA  r 

where  A  is  a  closed  subinterval  of  (0,  l).15  The  distribution  of  the  test  statistic  under 
the  null  hypothesis  for  the  three  cases  is  tabulated  in  Zivot  and  Andrews  (1992).  This 
table  then  allows  to  determine  the  appropriate  critical  values  for  the  test.  In  practice, 
one  has  to  take  the  autocorrelation  of  the  time  series  into  account  by  one  of  the 
methods  discussed  previously. 

This  testing  strategy  can  be  adapted  to  determine  the  time  of  a  structural 
break  in  the  linear  trend  irrespective  of  whether  the  process  is  trend-stationary  or 
integrated  of  order  one.  The  distributions  of  the  corresponding  test  statistics  have 
been  tabulated  by  Vogelsang  (1997). 16 


15Taking  the  infimum  over  A  instead  over  (0,  1)  is  for  theoretical  reasons  only.  In  practice,  the 
choice  of  A  plays  no  important  role.  For  example,  one  may  take  A  =  [0.01,  0.99]. 

16See  also  the  survey  by  Perron  (2006). 
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7.4.2  Testing  for  Stationarity  (KPSS  Test) 

The  unit-root  tests  we  discussed  so  far  tested  the  null  hypothesis  that  the  process 
is  integrated  of  order  one  against  the  alternative  hypothesis  that  the  process 
is  integrated  of  order  zero  (i.e.  is  stationary).  However,  one  may  be  interested 
in  reversing  the  null  and  the  alternative  hypothesis  and  test  the  hypothesis  of 
stationarity  against  the  alternative  that  the  process  is  integrated  of  order  one.  Such 
a  test  has  been  proposed  by  Kwiatkowski  et  al.  (1992),  called  the  KPSS-Test.  This 
test  rests  on  the  idea  that  according  to  the  Beveridge-Nelson  decomposition  (see 
Sect.  7.1.4)  each  integrated  process  of  order  one  can  be  seen  as  the  sum  of  a  linear 
time  trend,  a  random  walk  and  a  stationary  process: 

t 

Xj  =  a  8t  d  Ezj  +  u« 
j= i 

where  {U,}  denotes  a  stationary  process.  If  d  =  0  then  the  process  becomes  trend¬ 
stationary,  otherwise  it  is  integrated  of  order  one.17  Thus,  one  can  state  the  null  and 
the  alternative  hypothesis  as  follows: 


Ho  :  cl  =  0  against  Hi  :  d  ^  0. 

Denote  by  {S,}  the  process  of  partial  sums  obtained  from  the  residuals  {et}  of 
a  regression  of  X,  against  a  constant  and  a  linear  time  trend,  i.e.  St  =  Yl'j=  i  ej-l& 
Under  the  null  hypothesis  d  =  0,  { S, }  is  integrated  of  order  one  whereas 
under  the  alternative  {St}  is  integrated  of  order  two.  Based  on  this  consideration 
Kwiatkowski  et  al.  propose  the  following  test  statistic  for  a  time  series  consisting  of 
T  observations: 


yt  s 2 

KPSS  test  statistic:  WT  =  1  '  (7.3) 

T2Jt 

where  JT  is  an  estimate  of  the  long-run  variance  of  { Ut}  (see  Sect.  4.4).  As  {S',}  is  an 
integrated  process  under  the  null  hypothesis,  the  variance  of  {5)}  grows  linearly  in  t 
(see  Sect.  1.4.4  or  7.2)  so  that  the  sum  of  squared  S,  diverges  at  rate  T2.  Thus,  the  test 
statistic  remains  bounded  and  can  be  shown  to  converge.  Note  that  the  test  statistic 
is  independent  from  further  nuisance  parameters.  Under  the  alternative  hypothesis, 
however,  {S,}  is  integrated  of  order  two.  Thus,  the  null  hypothesis  will  be  rejected  for 
large  values  of  WT.  The  corresponding  asymptotic  critical  values  of  the  test  statistic 
are  reported  in  Table  7.4. 


17If  the  data  exhibit  no  trend,  one  can  set  <5  equal  to  zero. 

18This  auxiliary  regression  may  include  additional  exogenous  variables. 
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Table  7.4  Critical  values  of 
the  KPSS  test 


Regression  without  time  trend 

Significance  level 

0.1 

0.05 

0.01 

Critical  value 

0.347 

0.463 

0.739 

Regression  with  time  trend 

Significance  level 

0.1 

0.05 

0.01 

Critical  value 

0.119 

0.146 

0.216 

See  Kwiatkowski  et  al.  (1992) 


7.5  Regression  with  Integrated  Variables 
7.5.1  The  Spurious  Regression  Problem 

The  discussion  on  the  Dickey-Fuller  and  Phillips-Perron  tests  showed  that  in 
a  regression  of  the  integrated  variables  X,  on  its  past  Af_i  the  standard  ■\ff - 
asymptotics  no  longer  apply.  A  similar  conclusion  also  holds  if  we  regress  an 
integrated  variable  X,  against  another  integrated  variable  Y,.  Suppose  that  both 
processes  {A,}  and  { Yt }  are  generated  as  a  random  walk: 


X,  =  Xt-\  +  Ut,  t/,~UD(0.o£) 

Y,  =  K,_,  +  V,  ~  IID(0,  ay) 


where  the  processes  {(/,}  and  { V, }  are  uncorrelated  with  each  other  at  all  leads  and 
lags.  Thus, 


E([/,ys)  =  0,  for  all  t,s  eZ. 
Consider  now  the  regression  of  Yt  on  X,  and  a  constant: 


Yr  —  oi  +  jiXt  +  Ef. 


As  {A,}  and  {Yt}  are  two  random  walks  which  are  uncorrelated  with  each  other  by 
construction,  one  would  expect  that  the  OLS-estimate  of  the  coefficient  of  A,,  /J, 
should  tend  to  zero  as  the  sample  size  T  goes  to  infinity.  The  same  is  expected  for 
the  coefficient  of  determination  R2.  This  is,  however,  not  true  as  has  already  been 
remarked  by  Yule  (1926)  and,  more  recently,  by  Granger  and  Newbold  (1974).  The 
above  regression  will  have  a  tendency  to  “discover”  a  relationship  between  Y,  and  A, 
despite  the  fact  that  there  is  none.  This  phenomenon  is  called  spurious  correlation 
or  spurious  regression.  Similarly,  unreliable  results  would  be  obtained  by  using 
a  simple  t-test  for  the  null  hypothesis  /l  =  0  against  the  alternative  hypothesis 
P  7^  0.  The  reason  for  these  treacherous  findings  is  that  the  model  is  incorrect 
under  the  null  as  well  as  under  the  alternative  hypothesis.  Under  the  null  hypothesis 
{e,}  is  an  integrated  process  which  violates  the  standard  assumption  for  OLS.  The 
alternative  hypothesis  is  not  true  by  construction.  Thus,  OLS-estimates  should  be 
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interpreted  with  caution  when  a  highly  autocorrelated  process  {F,}  is  regressed  on 
another  highly  correlated  process  {X,}.  A  detailed  analysis  of  the  spurious  regression 
problem  is  provided  by  Phillips  (1986). 

The  spurious  regression  problem  can  be  illustrated  by  a  simple  Monte  Carlo 
study.  Specifying  U,  ~  IIDN(0, 1)  and  V,  ~  IIDN(0,  1),  we  constructed  A  =  1000 
samples  for  {F,}  and  {X,}  of  size  T  =  1000  according  to  the  specification  above. 
The  sample  size  was  chosen  especially  large  to  demonstrate  that  this  is  not  a  small 
sample  issue.  As  a  contrast,  we  constructed  two  independent  AR(1)  processes  with 
AR-coefficients  fx  =  0.8  and  fy  =  —0.5. 

Figures  7.5a,b  show  the  drastic  difference  between  a  regression  with  stationary 
variables  and  integrated  variables.  Whereas  the  distribution  of  the  OLS-estimates 
of  f>  is  highly  concentrated  around  the  true  value  /l  =  0  in  the  stationary  case,  the 
distribution  is  very  flat  in  the  case  of  integrated  variables.  A  similar  conclusion  holds 
for  the  corresponding  t-value.  The  probability  of  obtaining  a  t-value  greater  than 
1 .96  is  bigger  than  0.9.  This  means  that  in  more  than  90  %  of  the  time  the  t-statistic 
leads  to  a  rejection  of  the  null  hypothesis  and  therefore  suggests  a  relationship 
between  F,  and  X ,  despite  their  independence.  In  the  stationary  case,  this  probability 
turns  out  to  be  smaller  than  0.05.  These  results  are  also  reflected  in  the  coefficient  of 
determination  R2.  The  median  R2  is  approximately  0.17  in  the  case  of  the  random 
walks,  but  only  0.0002  in  the  case  of  AR(1)  processes. 

The  problem  remains  the  same  if  { X, }  and  {F,}  are  specified  as  random  walks 
with  drift: 


U,  ~  IID(0,  al) 

V,  ~IID(0,crp) 


X,  —  8x  +  X,-i  +  Ut, 

Y,  =  Sy  +  Yt- 1  +  V„ 


where  {Ut}  and  { V, J  are  again  independent  from  each  other  at  all  leads  and  lags. 
The  regression  would  be  same  as  above: 


Y,  —  ot  +  fXt  +  Gf. 


7.5.2  Bivariate  Cointegration 

The  spurious  regression  problem  cannot  be  circumvented  by  first  testing  for  a  unit 
root  in  Y,  and  X,  and  then  running  the  regression  in  first  differences  in  case  of  no 
rejection  of  the  null  hypothesis.  The  reason  being  that  a  regression  in  the  levels  of 
Y,  and  X ,  may  be  sensible  even  when  both  variables  are  integrated.  This  is  the  case 
when  both  variables  are  cointegrated.  The  concept  of  cointegration  goes  back  to 
Engle  and  Granger  (1987)  and  initiated  a  literal  research  boom.  We  will  give  a  more 
general  definition  in  Chap.  16  when  we  deal  with  multivariate  time  series.  Here  we 
stick  to  the  case  of  two  variables  and  present  the  following  definition. 

Definition  7.2  (Cointegration,  Bivariate).  Two  stochastic  processes  {A,}  and  {Y,} 
are  called  cointegrated  if  the  following  two  conditions  are  fulfilled: 
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Fig.  7.5  Distribution  of  OLS-estimate  /J  and  t-statistic  tg  for  two  independent  random  walks  and 
two  independent  AR(1)  processes,  (a)  Distribution  of  ft.  (b)  Distribution  of  t (c)  Distribution  of 
fi  and  t-statistic  t ^ 

(i)  {X,}  and  {Tf}  are  both  integrated  processes  of  order  one,  i.e.  X,  ~  1(1)  and 
Yt  ~  i(i); 

(ii)  there  exists  a  constant  fi  ^  0  such  that  {Y,  —  fix,}  is  a  stationary  process,  i.e. 
{Y,  -  fiX,}  ~  1(0). 

The  issue  whether  two  integrated  processes  are  cointegrated  can  be  decided  on 
the  basis  of  a  unit  root  test.  Two  cases  can  be  distinguished.  In  the  first  one,  fi  is 
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assumed  to  be  known.  Thus,  one  can  immediately  apply  the  augmented  Dickey- 
Fuller  (ADF)  or  the  Phillips-Perron  (PP)  test  to  the  process  {Y,  —  f>X,\.  Thereby  the 
same  issue  regarding  the  specification  of  the  deterministic  part  arises.  The  critical 
values  can  be  retrieved  from  the  usual  tables  (for  example  from  MacKinnon  1991). 
In  the  second  case,  jJ>  is  not  known  and  must  be  estimated  from  the  data.  This  can  be 
done  running,  as  a  first  step,  a  simple  (cointegrating)  regression  of  Yt  on  X,  including 
a  constant  and/or  a  time  trend.19  Thereby  the  specification  of  the  deterministic  part 
follows  the  same  rules  as  before.  The  unit  root  test  is  then  applied,  in  the  second 
step,  to  the  residuals  from  this  regression.  As  the  residuals  have  been  obtained 
from  a  preceding  regression,  we  are  faced  with  the  so-called  “generated  regressor 
problem”.20  This  implies  that  the  usual  Dickey-Fuller  tables  can  no  longer  be  used, 
instead  the  tables  provided  by  Phillips  and  Ouliaris  ( 1 990)  become  the  relevant  ones. 
As  before,  the  corresponding  asymptotic  distribution  depends  on  the  specification 
of  the  deterministic  part  in  the  cointegrating  regression.  If  this  regression  included 
a  constant,  the  residuals  have  necessary  a  mean  of  zero  so  that  the  Dickey-Fuller 
regression  should  include  no  constant  (case  1  in  Table  7.1): 


et  —  (pet-\  + 


where  et  and  denote  the  residuals  from  the  cointegrating  and  the  residuals  of  the 
Dickey-Fuller  regression,  respectively.  In  most  applications  it  is  necessary  to  correct 
for  autocorrelation  which  can  be  done  by  including  additional  lagged  differences 
. . . ,  Ae,_p+i  as  in  the  ADF-test  or  by  adjusting  the  t-statistic  as  in  the  PP- 
test.  The  test  where  /I  is  estimated  from  a  regression  is  called  the  regression  test  for 
cointegration.  Note  that  if  the  two  series  are  cointegrated  then  the  OLS  estimate  of 
ft  is  (super)  consistent. 

In  principle  it  is  possible  the  generalize  this  single  equation  approach  to  more 
than  two  variables.  This  encounters,  however,  some  conceptual  problems.  First, 
there  is  the  possibility  of  more  than  one  linearly  independent  cointegrating  rela¬ 
tionships  which  cannot  be  detected  by  a  single  regression.  Second,  the  dependent 
variable  in  the  regression  may  not  be  part  of  the  cointegrating  relation  which  might 
involves  only  the  other  variables.  In  such  a  situation  the  cointegrating  regression  is 
again  subject  to  the  spurious  regression  problem.  These  issues  turned  the  interest 
of  the  profession  towards  multivariate  approaches.  Chapter  16  presents  alternative 
procedures  and  discusses  the  testing,  estimation,  and  interpretation  of  cointegrating 
relationships  in  detail. 


19Thereby,  in  contrast  to  ordinary  OLS  regressions,  it  is  irrelevant  which  variable  is  treated  as  the 
left  hand,  respectively  right  hand  variable. 

20This  problem  was  first  analyzed  by  Nicholls  and  Pagan  (1984)  and  Pagan  (1984)  in  a  stationary 
context. 
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An  Example  for  Bivariate  Cointegration 

As  an  example,  we  consider  the  relation  between  the  short-term  interest  rate, 
{R3M,},  and  inflation,  {INFL, },  in  Switzerland  over  the  period  January  1989  to 
February  2012.  As  the  short-term  interest  rate  we  take  the  three  month  LIBOR. 
Both  time  series  are  plotted  in  Fig.  7.6a.  As  they  are  integrated  according  to  the 
unit  root  tests  (not  shown  here),  we  can  look  for  cointegration.  The  cointegrating 
regression  delivers: 

INFL,  =  -0.0088  +  0.5535  R3M,  +  eu  R2  =  0.7798.  (7.4) 

The  residuals  from  this  regression,  denoted  by  et,  are  represented  in  Fig.  7.6b.  The 
ADF  unit  root  test  of  these  residuals  leads  to  a  value  of  —3.617  for  the  t-statistic. 
Thereby  an  autoregressive  correction  of  13  lags  was  necessary  according  to  the  AIC 
criterion.  The  corresponding  value  of  the  t-statistic  resulting  from  the  PP  unit  root 
test  using  a  Bartlett  window  with  band  width  7  is  —4.294.  Taking  a  significance 
level  of  5  %,  the  critical  value  according  to  Phillips  and  Ouliaris  (1990,  Table  lib)  is 
—3. 365. 21  Thus,  the  ADF  as  well  as  the  PP  test  reject  the  null  hypothesis  of  a  unit 
root  in  the  residuals.  This  implies  that  inflation  and  the  short-term  interest  rate  are 
cointegrated. 


7.5.3  Rules  to  Deal  with  Integrated  Times  Series 

The  previous  sections  demonstrated  that  the  handling  of  integrated  variables  has 
to  be  done  with  care.  We  will  therefore  in  this  section  examine  some  rules  of 
thumb  which  should  serve  as  a  guideline  in  practical  empirical  work.  These  rules 
are  summarized  in  Table  7.5.  In  that  this  section  follows  very  closely  the  paper  by 
Stock  and  Watson  (1988b)  (see  also  Campbell  and  Perron  1991). 22  Consider  the 
linear  regression  model: 


Yt  —  fi o  +  PiXij  +  . . .  +  PrXkj  +  st.  (7.5) 


This  model  is  usually  based  on  two  assumptions: 

(1)  The  disturbance  term  e,  is  white  noise  and  is  uncorrelated  with  any  regressor. 
This  is,  for  example,  the  case  if  the  regressors  are  deterministic  or  exogenous. 

(2)  All  regressors  are  either  deterministic  or  stationary  processes. 

If  Eq.  (7.5)  represents  the  true  data  generating  process,  {Yt}  must  be  a  stationary 
process.  Under  the  above  assumptions,  the  OLS-estimator  is  consistent  and  the 
OLS-estimates  are  asymptotically  normally  distributed  so  that  the  corresponding 
t-  and  F-statistics  will  be  approximately  distributed  as  t-  and  F-  distributions. 


21For  comparison,  the  corresponding  critical  value  according  to  MacKinnon  (1991)  is  —2.872. 
22For  a  thorough  analysis  the  interested  reader  is  referred  to  Sims  et  al.  (1990). 
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Fig.  7.6  Cointegration  of  inflation  and  three-month  LIBOR,  (a)  Inflation  and  three-month 
LIBOR,  (b)  Residuals  from  cointegrating  regression 
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Consider  now  the  case  that  assumption  2  is  violated  and  that  some  or  all 
regressors  are  integrated,  but  that  instead  one  of  the  two  following  assumptions 
holds: 

(2. a)  The  relevant  coefficients  are  coefficients  of  mean-zero  stationary  variables. 
(2.b)  Although  the  relevant  coefficients  are  those  of  integrated  variables,  the 
regression  can  be  rearranged  in  such  a  way  that  the  relevant  coefficients 
become  coefficients  of  mean-zero  stationary  variables. 

Under  assumptions  1  and  2. a  or  2.b  the  OLS-estimator  remains  consistent.  Also  the 
corresponding  t-  and  F-statistics  remain  valid  so  that  the  appropriate  critical  values 
can  be  retrieved  from  the  t-,  respectively  F-distribution.  If  neither  assumption  2. a 
nor  2.b  holds,  but  the  following  assumption: 

(2.c)  The  relevant  coefficients  are  coefficients  of  integrated  variables  and  the 
regression  cannot  be  rewritten  in  a  way  that  they  become  coefficients  of 
stationary  variables. 

If  assumption  1  remains  valid,  but  assumption  2.c  holds  instead  of  2. a  and  2.b,  the 
OLS-estimator  is  still  consistent.  However,  the  standard  asymptotic  theory  for  the  t- 
and  the  F-statistic  fails  so  that  they  become  useless  for  normal  statistical  inferences. 

If  we  simply  regress  one  variable  on  another  in  levels,  the  error  term  et  is  likely 
not  to  follow  a  white  noise  process.  In  addition,  it  may  even  be  correlated  with  some 
regressors.  Suppose  that  we  replace  assumption  1  by: 

(l.a)  The  integrated  dependent  variable  is  cointegrated  with  at  least  one  integrated 
regressor  such  that  the  error  term  is  stationary,  but  may  remain  autocorrelated 
or  correlated  with  the  regressors. 

Under  assumptions  La  and  2. a,  respectively  2.b,  the  regressors  are  stationary,  but 
correlated  with  the  disturbance  term,  in  this  case  the  OLS-estimator  becomes  incon¬ 
sistent.  This  situation  is  known  as  the  classic  omitted  variable  bias,  simultaneous 
equation  bias  or  errors-in- variable  bias.  However,  under  assumptions  1  .a  and  2.c,  the 
OLS-estimator  is  consistent  for  the  coefficients  of  interest.  However,  the  standard 
asymptotic  theory  fails.  Finally,  if  both  the  dependent  variable  and  the  regressors  are 
integrated  without  being  cointegrated,  then  the  disturbance  term  is  integrated  and 
the  OLS-estimator  becomes  inconsistent.  This  is  the  spurious  regression  problem 
treated  in  Sect.  7.5.1 . 

Example:  Term  Structure  of  Interest 

We  illustrate  the  above  rules  of  thumb  by  investigating  again  the  relation  between 
inflation  ({INFL,})  and  the  short-term  interest  rate  ({R3M,  j  ).  In  Sect.  7.5.2  we  found 
that  the  two  variables  are  cointegrated  with  coefficient  /l  =  0.5535  (see  Eq.  (7.4)). 
In  a  further  step  we  want  to  investigate  a  dynamic  relation  between  the  two  variables 
and  estimate  the  following  equation: 


7.5  Regression  with  Integrated  Variables 


165 


Table  7.5  Rules  of  thumb  in  regressions  with  integrated  processes 


Assumptions 

OLS-estimator 

Remarks 

Consistency 

Standard  asymptotics 

(1) 

(2) 

Yes 

Yes 

Classic  results  for  OLS 

(1) 

(2,a) 

Yes 

Yes 

CD 

(2-b) 

Yes 

Yes 

CD 

(2.c) 

Yes 

No 

(la) 

(2.a) 

No 

No 

Omitted  variable  bias 

(la) 

(2.b) 

No 

No 

Omitted  variable  bias 

(la) 

(2.c) 

Yes 

No 

Neither  (1)  nor  (l.a) 

(2.c) 

No 

No 

Spurious  regression 

Source:  Stock  and  Watson  (1988b);  results  for  the  coefficients  of  interest 


R3M,  =  c  +  <p  iR3M,_!  +  </>2R3M;_2  +  03  R3M,_3  +  ^INFL,-,  +  S2INFL,_2  +  e, 


where  s,  ~  WN(0,  a2).  In  this  regression  we  want  to  test  the  hypotheses  r/;3  =  0 
against  03  ^  0  and  Si  =  0  against  Si  ^  0  by  examining  the  corresponding  simple 
t-statistics.  Note  that  we  are  in  the  context  of  integrated  variables  so  that  the  rules  of 
thumb  summarized  in  Table  7.5  apply.  We  can  rearrange  the  above  equation  as 


AR3M,  =  c 

+  ( <p\  +  <p2  +  <^>3  —  l)R3M,_i  —  ( (j>2  +  03)AR3M,_i  — 1/>3  AR3M,_2 

+  SiINFL,_i  +  S2INFL;_2  +  st. 


is  now  a  coefficient  of  a  stationary  variable  in  a  regression  with  a  stationary 
dependent  variables.  In  addition  s,  ~  WN(0,  a2)  so  that  assumptions  (1)  und  (2.b) 
are  satisfied.  We  can  therefore  use  the  ordinary  t-statistic  to  test  the  hypothesis  <^>3  = 
0  against  (j>\  ^  0.  Note  that  it  is  not  necessary  to  actually  carry  out  the  rearrangement 
of  the  equation.  All  relevant  item  can  be  retrieved  from  the  original  equation. 

To  test  the  hypothesis  <5i  =  0,  we  rearrange  the  equation  to  yield: 


AR3M,  =  c 

+  (0,  +  SiP  -  l)R3M,_i  +  <£2R3M,_2  +  4>3R3M,_3 

+  S i  (INFL,_ |  -  ^R3M,_i)  +  52INFL,_2  +  st. 

As  {R3M,}  and  {INFL,}  are  cointegrated,  INFL,_i  —  /§R3M,_i  is  stationary.  Thus 
assumptions  (1)  and  (2.b)  hold  again  and  we  use  once  more  the  simple  t-test.  As 
before  it  is  not  necessary  to  actually  carry  out  the  transformation. 
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The  prices  of  financial  market  securities  are  often  shaken  by  large  and  time-varying 
shocks.  The  amplitudes  of  these  price  movements  are  not  constant.  There  are  periods 
of  high  volatility  and  periods  of  low  volatility.  Within  these  periods  volatility 
seems  to  be  positively  autocorrelated:  high  amplitudes  are  likely  to  be  followed 
by  high  amplitudes  and  low  amplitudes  by  low  amplitudes.  This  observation  which 
is  particularly  relevant  for  high  frequency  data  such  as,  for  example,  daily  stock 
market  returns  implies  that  the  conditional  variance  of  the  one-period  forecast  error 
is  no  longer  constant  (homoskedastic),  but  time-varying  (heteroskedastic).  This 
insight  motivated  Engle  (1982)  and  Bollerslev  (1986)  to  model  the  time-varying 
variance  thereby  triggering  a  huge  and  still  growing  literature.1  The  importance  of 
volatility  models  stems  from  the  fact  that  the  price  of  an  option  crucially  depends 
on  the  variance  of  the  underlying  security  price.  Thus  with  the  surge  of  derivative 
markets  in  the  last  decades  the  application  of  such  models  has  seen  a  tremendous 
rise.  Another  use  of  volatility  models  is  to  assess  the  risk  of  an  investment.  In  the 
computation  of  the  so-called  value  at  risk  (VaR),  these  models  have  become  an 
indispensable  tool.  In  the  banking  industry,  due  to  the  regulations  of  the  Basel 
accords,  such  assessments  are  in  particular  relevant  for  the  computation  of  the 
required  equity  capital  backing-up  assets  of  different  risk  categories. 

The  following  exposition  focuses  on  the  class  of  autoregressive  conditional 
heteroskedasticity  models  (ARCH  models)  and  their  generalization  the  generalized 
autoregressive  conditional  heteroskedasticity  models  (GARCH  models).  These 


'Robert  F.  Engle  III  was  awarded  the  Nobel  prize  in  2003  for  his  work  on  time- varying  volatility. 
His  Nobel  lecture  (Engle  2004)  is  a  nice  and  readable  introduction  to  this  literature. 
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models  form  the  basis  for  even  more  generalized  models  (see  Bollerslev  et  al.  ( 1 994) 
or  Gourieroux  (1997)).  Campbell  et  al.  (1997)  provide  a  broader  economically 
motivated  approach  to  the  econometric  analysis  of  financial  market  data. 


8.1  Specification  and  Interpretation 

8.1.1  Forecasting  Properties  of  AR(1)-Models 

Models  of  volatility  play  an  important  role  in  explaining  the  behavior  of  financial 
market  data.  They  start  from  the  observation  that  periods  of  high  (low)  volatility  are 
clustered  in  specific  time  intervals.  In  these  intervals  high  (low)  volatility  periods 
are  typically  followed  by  high  (low)  volatility  periods.  Thus  volatility  is  usually 
positively  autocorrelated  as  can  be  observed  in  Fig.  8.3.  In  order  to  understand 
this  phenomenon  we  recapitulate  the  forecasting  properties  of  the  AR(1)  model.2 
Starting  from  the  model 

X,  =  c  +  cpX,-i  +  Z,,  Z,  ~  IID(0,  a2)  and  |0|  <  1, 

the  best  linear  forecast  in  the  mean-squared-error  sense  of  Xt+\  conditional  on 
{XtlXt-i ,...},  denoted  by  FtXt+i,  is  given  by  (see  Chap.  3) 


PA+i  =  c  +  4>xt. 


In  practice  the  parameters  c  and  0  are  replaced  by  an  estimate. 

The  conditional  variance  of  the  forecast  error  then  becomes: 

E,  (Xt+1  -  P,X,+1)2  =  E,Z2+1  =  a2, 

where  Ef  denotes  the  conditional  expectation  operator  based  on  information 

X, ,  Xr- 1 . The  conditional  variance  of  the  forecast  error  is  therefore  constant, 

irrespective  of  the  current  state. 

The  unconditional  forecast  is  simply  the  expected  value  of  EXf+i  =  fx  = 
with  forecast  error  variance: 

/  \  2  2 

E  —  - — —  J  =  E  (Z,+i  +  0Zf  +  02Z,_i  +  . . .)  =  - — >  a2. 

Thus  the  conditional  as  well  as  the  unconditional  variance  of  the  forecast  error  are 
constant.  In  addition,  the  conditional  variance  is  smaller  and  thus  more  precise 
because  it  uses  more  information.  Similar  arguments  can  be  made  for  ARMA 
models  in  general. 


2Instead  of  assuming  Z,  ~  WN(0,  a1),  we  make  for  convenience  the  stronger  assumption  that 
Z,  ~  IID(0,  a2). 
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8.1.2  The  ARCH(1)  Model 

The  volatility  of  financial  market  prices  exhibit  a  systematic  behavior  so  that  the 
conditional  forecast  error  variance  is  no  longer  constant.  This  observation  led  Engle 
(1982)  to  consider  the  following  simple  model  for  heteroskedasticity  (non-constant 
variance). 

Definition  8.1  (ARCH(l)  Model).  A  stochastic  process  {Z,},  t  e  Z,  is  called  an 
autoregressive  conditional  heteroskedastic  process  of  order  one,  ARCH(l)  process, 
if  it  is  the  solution  of  the  following  stochastic  difference  equation: 


with  otg  >  0  and  0  <  a,\  <  1 , 


(8.1) 


where  v,  ~  IIDN(0,  1)  and  where  v,  and  Zr_i  are  independent  from  each  other  for 
all  t  £ 

We  will  discuss  the  implications  of  this  simple  model  below  and  consider 
generalizations  in  the  next  sections.  First  we  prove  the  following  theorem. 

Theorem  8.1.  Under  conditions  stated  in  the  definition  of  the  ARCH(  1 )  process, 
the  difference  equation  (8.1)  possesses  a  unique  and  strictly  stationary  solution  with 
EZ2  <  oo.  This  solution  is  given  by 


Z,  =  v, 


N 


Proof.  Define  the  process 


Y,  =  Z]  =  v?(a0  +  aiTf_i) 


(8.3) 


Iterating  backwards  k  times  we  get: 

Y,  =  a0v2  +  <Xiv;Y,-X  =  a0v 2  +  aivfv^ao  +  aiT,_2) 
=  a0v;  +  otooiivfvf-i  +  a\  v  f  v?_  ]Yt-2 
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Define  the  process  {F'}  as 

oo 

Y\  =  a0vf  +  «*£«'  vfvjLj . . .  i >]_j. 

7=1 

The  right-hand  side  of  the  above  expression  just  contains  nonnegative  terms. 
Moreover,  making  use  of  the  IIDN(0, 1)  assumption  of  { 1.7} , 


(  00 

E Y\  =  E(o;ov2)  +  aoE  I 


=  a0  y~^  c^i 
7=0 


ap 

1  —  ai  ’ 


Thus,  0  <  F'  <  00  a.s.  Therefore,  {F'}  is  strictly  stationary  and  satisfies  the 
difference  equation  (8.3).  This  implies  that  Z,  =  V/Tj  is  also  strictly  stationary 
and  satisfies  the  difference  equation  (8.1). 

To  prove  uniqueness,  we  follow  Giraitis  et  al.  (2000).  For  any  fixed  f,  it  follows 
from  the  definitions  of  F,  and  F'  that  for  any  k  >  I 


|Ff  -  F;|  <  a\+1v?v?_i . . .  v^_k\Yt-k-\  \  +  a0  ar^v2 


...  17 


f-7’ 


7=*+l 


The  expectation  of  the  right-hand  side  is  bounded  by 

Define  the  event  A*  by  =  {\Yt  —  F'|  >  1  /£}.  Then, 

P(At)  <  *E|F,  -  f;|  <  *  (e|Fj|  +  )  a\+l 

where  the  first  inequality  follows  from  Chebyschev’s  inequality  setting  r  =  1 
(see  Theorem  C.3).  Thus,  YLr=  1  PM  a)  <  00.  The  Borel-Cantelli  lemma  (see 
Theorem  C.4)  then  implies  that  P{Ak  i.o.)  =  0.  However,  as  C  A*+i,  P(/U)  =  0 
for  any  k.  Thus,  F,  =  Y't  a.s.  □ 

Remark  8.1.  Note  that  the  normality  assumption  is  not  necessary  for  the  proof.  The 
assumption  vt  ~  IID(0, 1)  would  be  sufficient.  Indeed,  in  practice  it  has  been  proven 
useful  to  adopt  distributions  with  fatter  tail  than  the  normal,  like  the  t-distribution 
(see  the  discussion  in  Sect.  8.1.3). 
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Given  the  assumptions  made  above  {Z,}  has  the  following  properties: 


(i)  The  expected  value  of  Z,  is: 


a0  +  oi\Zf_1  =  0. 


This  follows  from  the  assumption  that  v,  and  Zt-\  are  independent. 

(ii)  The  covariances  between  Z,  and  Z,_/, ,  E ZfZr_/,,  for  ft  /  0  are  given  by: 


This  is  also  a  consequence  of  the  independence  assumption  between  v,  and 
Zf_ i,  respectively  between  vt-h  and  Z,_/,_  j . 

(iii)  The  variance  of  Z,  is: 


VZ,  =  E Z]  =  Ev2  (cto  + 


=  Eu2  E  (ao  +  ctiZ2_j)  =  - - — 

1  —  Cti 


<  OO. 


This  follows  from  the  independence  assumption  between  v,  and  Z,_i  and  from 
the  stationarity  of  {Z,}.  Because  cy0  >  0  and  0  <  oq  <  1,  the  variance  is  always 
strictly  positive  and  finite. 

(iv)  As  v,  is  normally  distributed,  its  skewness,  Ev3,  equals  zero.  The  independence 
assumption  between  v,  and  Z2_j  then  implies  that  the  skewness  of  Z,  is  also 
zero,  i.e. 


EZf3  =  0. 


Z,  therefore  has  a  symmetric  distribution. 

The  properties  (i),  (ii)  and  (iii)  show  that  {Z,}  is  a  white  noise  process.  According 
to  Theorem  8.1  it  is  not  only  stationary  but  even  strictly  stationary.  Thus  {Z,}  is 
uncorrelated  with  Zt-\ ,  Z,_ 2*  •  •  ••  but  not  independent  from  its  past!  In  particular  we 
have: 


V(Zf| Zt-uZ,-!, . . .)  =  E  (Z2 1 Zf—  \ , Zr— 2 , . . .) 

=  Efv2  (ao  +  ot\Z~_^  =  0:0  +  QfiZ' 


rl 

•t- 1* 


The  conditional  variance  of  Zt  therefore  depends  on  Zt-\.  Note  that  this  dependence 
is  positive  because  a\  >  0. 
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In  order  to  guarantee  that  this  conditional  variance  is  always  positive,  we  must 
postulate  that  olq  >  0  and  ot\  >  0.  The  stability  of  the  difference  equation  requires 
in  addition  that  a \  <  l.3  Thus  high  volatility  in  the  past,  a  large  realization  of  Zf_ i, 
is  followed  by  high  volatility  in  the  future.  The  precision  of  the  forecast,  measured 
by  the  conditional  variance  of  the  forecast  error,  thus  depends  on  the  history  of  the 
process.  This  feature  is  not  compatible  with  linear  models  and  thus  underlines  the 
non-linear  character  of  the  ARCH  model  and  its  generalizations. 

Despite  the  fact  that  v,  was  assumed  to  be  normally  distributed,  Z,  is  not  normally 
distributed.  Its  distribution  deviates  from  the  normal  distribution  in  that  extreme 
realizations  are  more  probable.  This  property  is  called  the  heavy-tail  property.  In 
particular  we  have4: 

EZ4  =  Ev4  (a0  +  oqZjij)2  =  Ev4  (o-q  +  2aoa\Zf_l  +  a3Z,4_ j) 

=  3ao  +  +  3o,2]EZ4_i 

1  —  oq 


The  strict  stationarity  of  {Z, }  implies  EZ4  =  EZ4_j  so  that 


(1  -3q;2)EZ4 


3a;?  (1  +  at) 


ez;  = 


1  —  a\ 
1 


3a^ 


3oq(1  +  a  i) 

1  —  0l\ 


EZ4  is  therefore  positive  and  finite  if  and  only  if  3 aj  <  1,  respectively  if  0  <  &\  < 
I/a/3  =  0.5774.  For  high  correlation  of  the  conditional  variance,  i.e.  high  a\  > 
1  /  V3,  the  fourth  moment  and  therefore  also  all  higher  even  moments  will  no  longer 
exist.  The  kurtosis  k  is 


EZ,4  _  Q  ^  l-aj 
“  [EZ2p  ~3XT^3^ 


>  3, 


if  EZ4  exists.  The  heavy-tail  property  manifests  itself  by  a  kurtosis  greater  than  3 
which  is  the  kurtosis  of  the  normal  distribution.  The  distribution  of  Z,  is  therefore 
leptokurtic  and  thus  more  vaulted  than  the  normal  distribution. 

Finally,  we  want  to  examine  the  autocorrelation  function  of  Zf.  This  will  lead  to 
a  test  for  ARCH  effects,  i.e.  for  time  varying  volatility  (see  Sect.  8.2  below). 


3The  case  a-i  =  1  is  treated  in  Sect.  8.1.4. 

4As  v,  ~  N(0,  1)  its  even  moments,  =  Ev,2*,  k  =  1 . 2, . . .,  are  given  by  m-n  =  P|!=i  (7/  —  1)- 
Thus  we  get  m 4  =  3,  =  15,  etc.  As  the  normal  distribution  is  symmetric,  all  odd  moments  are 

equal  to  zero. 
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Theorem  8.2.  Assuming  that  E Z,  exists,  Yt  =  ^  fia.v  t/ze  same  autocorrelation 
function  as  the  AR(1)  process  W,  =  oi\Wt-i  +  Ut  with  Ut  ~  WN(0, 1).  In  addition, 
under  the  assumption  0  <  ct\  <  1,  the  process  {W,}  is  also  causal  with  respect 
to  {Ut}. 

Proof.  From  Y,  =  v2(l  +  oi\Yt-\)  we  get: 


Yr(h)  =  E Y,Yt_h  -  EY,EY,„h  =  E YtY,_h  - 


1 


(1  -ai)2 


Ev,  (1  +  aiYt-i)  Y,-h  ■ 


=  EYt-h+otlEY,-lYt-h- 


1 


(l-«i)2 

1 

(1  -  ai)2 


+  «i  (yY(h  -  1)  + 


1  —  a  i 


1 


(1  —  °t\)2  )  (1-ai)2 


1  —  ot\  +  ai  —  1 

ot\ YyQi  -  1)  H - - - rr -  =  aiyY(h  -  1). 


(1  -  ai)2 

Therefore,  yy(h)  =  ctjyy(O)  =>  p{h)  =  a\. 

The  unconditional  variance  of  X ,  is: 


yx,  =  v 


c 


j= 0  ) 


1 


-  vz,  = 


OtQ  1 

1  —  a\  1  —  (f>2 


The  unconditional  variance  of  X,  involves  all  parameters  of  the  model.  Thus 
modeling  the  variance  of  X,  induces  a  trade-off  between  cp,  ao  and  ai . 

Figure  8.1  plots  the  realizations  of  two  AR(1)-ARCFI(1)  processes. Both  pro¬ 
cesses  have  been  generated  with  the  same  realization  of  { v,)  and  the  same 
parameters  0  =  0.9  and  =  I .  Whereas  the  first  process  (shown  on  the  left 
panel  of  the  figure)  was  generated  with  a  value  of  a\  =  0.9  ,  the  second  one  had  a 
value  of  o']  =  0.5.  In  both  cases  the  stability  condition,  a\  <  1,  is  fulfilled,  but  for 
the  first  process  3 aj  >  1,  so  that  the  fourth  moment  does  not  exist.  One  can  clearly 
discern  the  large  fluctuations,  in  particular  for  the  first  process. 


8.1.3  General  Models  of  Volatility 

The  simple  ARCH(l)  model  can  be  and  has  been  generalized  in  several  directions. 
A  straightforward  generalization  proposed  by  Engle  (1982)  consists  by  allowing 
further  lags  to  enter  the  ARCH  equation  (8.1).  This  leads  to  the  ARCH(p)  model: 
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Zt  with  «]  =  0.9  Zt  with  otj  =  0.5 


X,  =  0.9  X,_!  +  Zt  X,  =  0.9  Xt_!  +  Zt 


Fig.  8.1  Simulation  of  two  ARCH(l)  processes  (a\  =  0.9  and  ai  =  0.5) 


P 

ARCH(p)  :  Z,  =  v,o,  with  af  =  atp  +  ^  (8-4) 

7=1 


where  ao  >  0,  a,  >  0  and  u,  ~  IIDN(0, 1)  with  vt  independent  from  Z,-jJ  >  1.  A 
further  popular  generalization  was  proposed  by  Bollerslev  (1986): 


p  i 

GARCH(p,  q)  :  Z,  =  v,o>  with  tx2  =  a0  +  ^  ayZjU  +  ^  Pja^~j  (8.5) 

7=1  7=1 


where  we  assume  cto  >  0,  a,  >  0,  p/  >  0  and  v,  ~  IIDN(0, 1)  with  vf  independent 
from  Z;_j,  j  >  1,  as  before.  This  model  is  analogous  the  ordinary  ARMA  model 
and  allows  for  a  parsimonious  specification  of  the  volatility  process.  All  coefficients 
should  be  positive  to  guarantee  that  the  variance  is  always  positive.  In  addition  it 
can  be  shown  (see  for  example  Fan  and  Yao  (2003,  150)  and  the  literature  cited 
therein)  that  {Z,}  is  (strictly)  stationary  with  finite  variance  if  and  only  if  Y^j=  i  aj  + 
J2j=  i  Pj  <  l-5  Under  this  condition  {Z,}  ~  WN(0,  aj)  with 


a\  =  V(Z,)  = 


a0 

aj  ~12j=i  Pi 


5  A  detailed  exposition  of  the  GARCH(l.l)  model  is  given  in  Sect.  8.1.4. 
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As  vt  is  still  normally  distributed  the  uneven  moments  of  the  distribution  of  Z,  are 
zero  and  the  distribution  is  thus  symmetric.  The  fourth  moment  of  Z,,  EZ4,  exists  if 


73 


i  -  EJ=i  Pj 


<  l. 


This  condition  is  sufficient,  but  not  necessary.6  Furthermore,  {Z,}  is  a  white  noise 
process  with  heavy-tail  property  if  {Z,}  is  strictly  stationary  with  finite  fourth 
moment. 

In  addition,  {Z2}  is  a  causal  and  invertible  ARMA(max  {p.  q } ,  q )  process  satisfy¬ 
ing  the  following  difference  equation: 


Z]  -  a0  +  ^  ^  Pjvlj  +  e, 

7=1  7=1 

ma \{jJ,q}  q 

=  “o  +  ^  (oij  +  Pj)Z;_j  +  e,  -  ^  Pje,-j, 
7=1  7=1 


where  a/)+J  =  fJ>q+J  =  0  for  j  >  1  and  “error  term” 


et  =  Z?~  ct,2  =  (v2  -  1) 


p  q  \ 

“o  +  a’Z‘-J  +  Z!  vrf-j  ■ 
7=1  7=1  / 


Note,  however,  there  is  a  circularity  here  because  the  noise  process  {et}  is  defined 
in  terms  of  Z2  and  is  therefore  not  an  exogenous  process  driving  Z2.  Thus,  one  has 
to  be  precautious  in  the  interpretation  of  {Z2}  as  an  ARMA  process. 

Further  generalizations  of  the  GARCH(p,q)  model  can  be  obtained  by  allowing 
deviations  from  the  normal  distribution  for  v,.  In  particular,  distributions  such  as 
the  t-distribution  which  put  more  weight  on  extreme  values  have  become  popular. 
This  seems  warranted  as  prices  on  financial  markets  exhibit  large  and  sudden 
fluctuations.7 


The  Threshold  GARCH  Model 

Assuming  a  symmetric  distribution  for  u,  and  specifying  a  linear  relationship 
between  ct2  and  Z2_y  bzw.  o£_.,  j  >  0,  leads  to  a  symmetric  distribution  for  Z,. 
It  has,  however,  been  observed  that  downward  movements  seem  to  be  different  from 


6Zadrozny  (2005)  derives  a  necessary  and  sufficient  condition  for  the  existence  of  the  fourth 
moment. 

7  A  thorough  treatment  of  the  probabilistic  properties  of  GARCH  processes  can  be  found  in  Nelson 
(1990),  Bougerol  and  Picard  (1992a),  Giraitis  et  al.  (2000),  Kliippelberg  et  al.  (2004,  theorem  2.1), 
and  Lindner  (2009). 
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upward  movements.  This  asymmetric  behavior  is  accounted  for  by  the  asymmetric 
GARCH(1,1)  model  or  threshold  GARCH(  1,1)  model  (TGARCH(1,1)  model).  This 
model  was  proposed  by  Glosten  et  al.  (1993)  and  Zakoian  (1994): 

asymmetric  GARCH(  1,1):  Z,  =  v,a,  with 

a,2  =  a0  +  aiZ2_!  + 

+  yl{z,_i<o}Zr-f 

l{z,_1<0}  denotes  the  indicator  function  which  takes  on  the  value  one  if  Z,_|  is 
negative  and  the  value  zero  otherwise.  Assuming,  as  before,  that  all  parameters  a <>, 
ai ,  p  and  y  are  greater  than  zero,  this  specification  postulates  a  leverage  effect 
because  negative  realizations  have  a  greater  impact  than  positive  ones.  In  order  to 
obtain  a  stationary  process  the  condition  a\  +  P  +  y /2  <  1  must  hold.  This  model 
can  be  generalized  in  an  obvious  way  by  allowing  additional  lags  Z(2_  ■  and  >  1 
to  enter  the  above  specification. 


The  Exponential  GARCH  Model 

Another  interesting  and  popular  class  of  volatility  models  was  introduced  by  Nelson 
(1991).  The  so-called  exponential  GARCH  models  or  EGARCH  models  are  defined 
as  follows: 


logo-2  =  a0  +  P  log o;_x  +  y 


Zt- 


0,-1 


+  S 


Z,_  i 

^-l 


=  a0  +  P  log  cr^j  +  y|v,_i|  + 


Note  that,  in  contrast  to  the  previous  specifications,  the  dependent  variable  is  the 
logarithm  of  cr2  and  not  cr2  itself.  This  has  the  advantage  that  the  variance  is  always 
positive  irrespective  of  the  values  of  the  coefficients.  Furthermore,  the  leverage 
effect  is  exponential  rather  than  quadratic  because  Z,  =  v,  exp(cr,/2).  The  EGARCH 
model  is  also  less  recursive  than  the  GARCH  model  as  the  volatility  is  specified 
directly  in  terms  of  the  noise  process  { v,  j .  Thus,  the  above  EGARCH  model  can  be 
treated  as  an  AR(1)  model  of  log  cr2  with  noise  process  y  |u,_i  |  +  8  v,_  | .  It  is  obvious 
that  the  model  can  be  generalized  to  allow  for  additional  lags  both  in  cr2  and  vt. 
This  results  in  an  ARMA  process  for  {log  ct2}  for  which  the  usual  conditions  for  the 
existence  of  a  causal  and  invertible  solution  can  be  applied  (see  Sect.  2.3).  A  detailed 
analysis  and  further  properties  of  this  model  class  can  be  found  in  Bollerslev  et  al. 
(1994),  Gourieroux  (1997)  and  Fan  and  Yao  (2003,  143-180). 


The  ARCH-in-Mean  Model 

The  ARCH-in-mean  model  or  ARCH-M  model  was  introduced  by  Engle  et  al. 
(1987)  to  allow  for  a  feedback  of  volatility  into  the  mean  equation.  More  specifi¬ 
cally,  assume  for  the  sake  of  simplicity  that  the  variance  equation  is  just  represented 
the  ARCH(l)  model 


Z,  =  v,a,  with  ct2  =  a0  +  aiZ2_[. 
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Then,  the  ARCH-M  model  is  given 

X,  =  M,p  +  g(a ,2)  +  Z,  (8.6) 

where  g  is  a  function  of  the  volatility  cr2  and  where  M't  consists  of  a  vector  of 
regressors,  including  lagged  values  of  X,.  If  M,  =  ( I ,  X,-  \ )  then  we  get  the  AR(1)- 
ARCH-M  model.  The  most  commonly  used  specification  for  g  is  a  linear  function: 
g(cr2)  =  So  +  <5 1 07 .  In  the  asset  pricing  literature,  higher  volatility  would  require  a 
higher  return  to  compensate  the  investor  for  the  additional  risk.  Thus,  if  X,  denotes 
the  return  on  some  asset,  we  expect  <5i  to  be  positive.  Note  that  any  time  variation 
in  cr2  translates  into  a  serial  correlation  of  {A,}  (see  Hong  1991,  for  details).  Of 
course,  one  could  easily  generalize  the  model  to  allow  for  more  sophisticated  mean 
and  variance  equations. 


8.1.4  The  GARCH(1,1)  Model 

The  Generalized  Autoregressive  Conditional  Heteroskedasticity  model  of  order 
(1,1),  GARCH(1,1)  model  for  short,  is  considered  as  a  benchmark  for  more  general 
specifications  and  often  serves  as  a  starting  point  for  further  empirical  investigations. 
We  therefore  want  to  explore  its  properties  in  more  detail.  Many  of  its  properties 
generalize  in  a  straightforward  way  to  the  GARCH(p,q)  process.  According  to 
Eq.  (8.5)  the  GARCH(1,1)  model  is  defined  as: 

GARCH(l.l):  Z,  =  v,cr,  with  a,2  =  a0  +  a\Z}t_x  +  (8.7) 

where  ao.aq.and  P  >  0.  We  assume  oq  +  ft  >  0  to  avoid  the  degenerate  case 
oq  =  P  =  0  which  implies  that  { Z, }  is  just  a  sequence  of  IID  random  variables. 
Moreover,  v,  ~  IID(0, 1)  with  v,  being  independent  of  Zt-j,j  >  1 .  Note  that  we  do 
not  make  further  distributional  assumption.  In  particular,  vt  need  not  required  to  be 
normally  distributed.  For  this  model,  we  can  formulate  a  similar  theorem  as  for  the 
ARCH(l)  model  (see  Theorem:  8.1): 

Theorem  8.3.  Let  {Z,}  be  a  GARCH(1,1)  process  as  defined  above.  Under  the 
assumption 

Elog(oqv2  +  P)  <  0, 

the  difference  equation  (8.7)  possess  strictly  stationary  solution: 

oo  j 

Z,  =  V,  a0  fJCaiv,2-,-  +  P) 

\  7=o  i=i 

where  ]~[2  =  1  whenever  i  >  j.  The  solution  is  also  unique  given  the  sequence  { vf}. 
The  solution  is  unique  and  (weakly)  stationary  with  variance  E Z2  =  <  oo 

if  +  P  <  1. 
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Proof.  The  proof  proceeds  similarly  to  Theorem  8.1.  For  this  purpose,  we  define 
Y,  =  of  and  rewrite  the  GARCH(1,1)  model  as 

Y,  =  a  o  +  a\v^_{Yt-\  +  fiY,-i  =  ao  +  (oqvf.j  +  /3)T,_  i- 

This  defines  an  AR(1)  process  with  time-varying  coefficients  £,  =  a  i  tf  +  f>  >  0. 
Iterate  this  equation  backwards  k  times  to  obtain: 

Y,  =  a o  +  CtO^-l  +  •  •  •  +  CXo^r-l  ■  ■  ■  %t-k  +  ?r-l  •  •  •  ^t-k^t-k-lYi-k-l 
k  j  k- i-i 

=  0,(1  y,  FT  +  0  %t-iYt-k- 1  • 

j=  0  1=1  1=1 

Taking  the  limit  k  — >  oo,  we  define  the  process  { Y't ) 

oo  j 

rt =«o  ^8-8) 

7=0  <=1 

The  right-hand  side  of  this  expression  converges  almost  surely  as  can  be  seen  from 
the  following  argument.  Given  that  v,  ~  IID  and  given  the  assumption  E  logfcti  vj  + 
ft)  <  0,  the  strong  law  of  large  numbers  (Theorem  C.5)  implies  that 


or  equivalently. 


Thus, 


lim  sup!  V  log  (£,_,)  ]  <  0 

j\U  J 


a.s., 


/  j  \ i/j 

lim  sup  log  FM  <0  as- 

j-*°°  \i= i  / 


( j  \ Ui 

lim  sup  n  <  1  a.s. 

\i=i  / 


The  application  of  the  root  test  then  shows  that  the  infinite  series  (8.8)  converges 
almost  surely.  Thus,  { Y't }  is  well-defined.  It  is  easy  to  see  that  { Y't }  is  strictly 
stationary  and  satisfies  the  difference  equation.  Moreover,  if  ct\  +  ft  <  1,  we  get 


oo  j 


OO 


ET'  =  Q!o  y  E  Y\  %t-i  =  Oi o  y  (aq  +  py 

7=0  i=l  7=0 


do 

1  -ai-  ft' 
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Thus,  EZ2  =  — — — o  <  oo. 

’  t  l—a\—p 

To  show  uniqueness,  we  assume  that  there  exists  another  strictly  stationary 
process  {Y,}  which  also  satisfies  the  difference  equation.  This  implies  that 

\Yt  -  Y',\  =  Ifr-tl  I  Y,-\  -  I  =  (n  \Yt-k  -  Y't_k I 

<  (n^)  i^i +(n^)  1^*1 

The  assumption  Elog£,  =  Elog(ctqv2  +  /3)  <  0  together  with  the  strong  law  of 
large  numbers  (Theorem  C.5)  imply 


As  both  solutions  are  strictly  stationary  so  that  the  distribution  of  \Y,-k\  and  \Y[_k\ 
do  not  depend  on  t,  this  implies  that  both  ( ]~[f=i  fr-i J  I  Yt-k\  and  (  ]~[/=i  I  Y't_k 
converge  in  probability  to  zero.  Thus,  Yt  =  Y't  a.s.  once  the  sequence  respectively 
v,,  is  given.  Because  Z,  =  v,y/Tt  this  completes  the  proof.  □ 

Remark  8.2.  Using  Jensen’s  inequality,  we  see  that 

Elog(ai  v2  +  fi)  <  logE(o!i  v2  +  fi)  =  log(ai  +  fi). 

Thus,  the  condition  oq  +  /}  <  1  is  sufficient,  but  not  necessary,  to  ensure  the 
existence  of  a  strictly  stationary  solution.  Thus  even  when  o q  +  /3  =  1,  a  strictly 
stationary  solution  exists,  albeit  one  with  infinite  variance.  This  case  is  known  as  the 
IGARCH  model  and  is  discussed  below.  In  the  case  oq  +  /l  <  1,  the  Borel-Cantelli 
lemma  can  be  used  as  Theorem  8.1  to  establish  the  uniqueness  of  the  solution. 
Further  details  can  be  found  in  the  references  listed  in  footnote  7. 

Assume  that  oq  +  /l  <  1,  then  a  unique  strictly  stationary  process  {Z,}  with  finite 
variance  which  satisfies  the  above  difference  equation  exists.  In  particular  Z,  ~ 
WN(0,  ctJ)  such  that 


o'  o 

1  -ot\  -P' 


V(Zf)  = 
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Fig.  8.2  Parameter  region  for  which  a  strictly  stationary  solution  to  the  GARCH(l.l)  process 
exists  assuming  vt  ~  IID  N  (0.  1 ) 

The  assumption  1  — oq  — /3  >  0  guarantees  that  the  variance  exists.  The  third  moment 
of  Z,  is  zero  due  to  the  assumption  of  a  symmetric  distribution  for  vt.  The  condition 
for  the  existence  of  the  fourth  moment  is:  <  l.8  The  kurtosis  is  then 

EZ!  _3„  l-to.+ffl2  '3 

[E Z2]2  1  -  (a\  +  fi)2  -  2a 2 

if  EZf4  exists.9  Therefore  the  GARCH(l.l)  model  also  possesses  the  heavy-tail 
property  because  Z,  is  more  peaked  than  the  normal  distribution. 

Figure  8.2  shows  how  the  different  assumptions  and  conditions  divide  up  the 
parameter  space.  In  region  I,  all  conditions  are  fulfilled.  The  process  has  a  strictly 
stationary  solution  with  finite  variance  and  kurtosis.  In  region  II,  the  kurtosis  does 
no  longer  exist,  but  the  variance  does  as  a\  +  ft  <  1  still  holds.  In  region  III,  the 
process  has  infinite  variance,  but  a  strictly  stationary  solution  yet  exists.  In  region 
IV,  no  such  solution  exists. 

Viewing  the  equation  for  a2  as  a  stochastic  difference  equation,  its  solution  is 
given  by 

OO 

a>=  TTr  +«iI>z^w  t8-9> 

P  f=o 


8 A  necessary  and  sufficient  condition  is  (ai  +  ft)2  +  2a2  <  1  (see  Zadrozny  (2005)). 

9The  condition  for  the  existence  of  the  fourth  moment  implies  3 c/2  <  (1  —  fi)2  so  that  the 
denominator  1  —  /J2  —  2ai/3  —  3af  >  1  —  p2  —  laifl  —  1  —  fi1  +  2/J  =  2/1(1  —  —  /J)  >  0. 
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This  expression  is  well-defined  because  0  <  P  <  1  so  that  the  infinite  sum 
converges.  The  conditional  variance  given  the  infinite  past  is  therefore  equal  to 

OO 

Y  (ZtlZt-i ,  Z,_ 2, . . .)  =  E  (Z,2 |Z,_! ,  Z/_2, . . .)  =  Pz?-i-j- 

P  7=0 

Thus,  the  conditional  variance  depends  on  the  entire  history  of  the  time  series  and 
not  just  onZ,_i  as  in  the  case  of  the  ARCH(l)  model.  As  all  coefficients  are  assumed 
to  be  positive,  the  clustering  of  volatility  is  more  persistent  than  for  the  ARCH(l) 
model. 

Defining  a  new  time  series  {et}  by  e,  =  Z2  —  of  =  (v2  —  l)(o!o  +a\Zf_l  +  Pop_x), 
one  can  verify  that  Z2  obeys  the  stochastic  difference  equation 

Zf  =  ao  +  «iZ2_!  +  Paf_  j  +  e,  =  ao  +  a.\Z~t_x  +  P(zf_  j  —  e,~\)  +  et 

=  ofo  +  (o;i  +  yS  )Z^_j  +  et  —  (8.10) 

This  difference  equation  defines  an  ARMA(1,1)  process  if  e,  has  finite  variance 
which  is  the  case  if  the  fourth  moment  of  Z,  exists.  In  this  case,  it  is  easy  to  verify 
that  {et}  is  white  noise.  The  so-defined  ARMA(1,1)  process  is  causal  and  invertible 
with  respect  to  {et}  because  0  <a\  +  fi  <  1  and  0  <  /l  <  1.  The  autocorrelation 
function  (ACF),  pzi(h),  can  be  computed  using  the  methods  laid  out  Sect.  2.4.  This 
gives 


(1  -  P2  -  aiP)ai  _  (1  -  pcp)(<p  -  ft) 

Pz2(  ’  l  —  P2  —  2a\P  [  +  P2-2tpp  ’ 

pZ2{h)  =  («!  +  P)pz2(h  -  1)  =  <pp#(h  -  1),  h  =  2,3,...  (8.11) 

with  (p  =  a.\  +  P  (see  also  Bollerslev  (1988)). 

The  IGARCH  Model 

Practice  has  shown  that  the  sum  oi\  +  P  is  often  close  to  one.  Thus,  it  seems 
interesting  to  examine  the  limiting  case  where  a.\  +  P  =  1.  This  model  was 
proposed  by  Engle  and  Bollerslev  (1986)  and  was  termed  the  integrated  GARCH 
(IGARCH)  model  in  analogy  to  the  notion  of  integrated  processes  (see  Chap.  7). 
From  Eq.  (8.10)  we  get 


Z,  —  ao  +  Z2_j  +  et  —  Pe,- 1 

with  e,  =  Z2  —  cr2  =  (v2  —  l)(«o  +  (1  —  P)Zf_ j  +  P&; Lx).  As  {e,}  is  white  noise, 
the  squared  innovations  Z2  behave  like  a  random  walk  with  a  MA(1)  error  term. 
Although  the  variance  of  Z,  becomes  infinite,  the  difference  equation  still  allows  for 
a  strictly  stationary  solution  provided  that  Elogjctqv2  +  /))  <  0  (see  Theorem  8.3 
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and  the  citations  in  footnote  7  for  further  details).10  It  has  been  shown  by  Lumsdaine 
(1986)  and  Lee  and  Hansen  (1994)  that  standard  inferences  can  still  be  applied 
although  a i  +  /)  =  1.  The  model  may  easily  generalized  to  higher  lag  orders. 

Forecasting 

On  many  occasions  it  is  necessary  to  obtain  forecasts  of  the  conditional  variance  ct2. 
An  example  is  given  in  Sect.  8.4  where  the  value  at  risk  (VaR)  of  a  portfolio  several 
periods  ahead  must  be  evaluated.  Denote  by  P ,ct2+/i  the  h  period  ahead  forecast  based 
on  information  available  in  period  t.  We  assume  that  predictions  are  based  on  the 
infinite  past.  Then  the  one-period  ahead  forecast  based  on  Z,  and  ct2,  respectively  v, 
and  ct2,  is: 


P,ct2+1  =  a0  +  a\Zj  +  =  a0  +  (aivf  +  /6)cr2.  (8.12) 

As  v,  ~  IID(0, 1)  and  independent  of  Z,  -jJ>  1. 

PrO-,+2  =  do  +  (Q!i  +  /8)P,CT2+1. 

Thus,  forecast  for  h  >  2  can  be  obtained  recursively  as  follows: 

P,CT2+/I  =  CTO  4-  (ai  +  P)P /CT2+/,_, 
h—2 

=  CTo  +  Pi  +  (CT!  +  P)h~lP ,CT2+1.  (8.13) 

j=  0 

Assuming  ai  +  ft  <  1,  the  second  term  in  the  above  expression  vanishes  as  h  goes 
to  infinity.  Thus,  the  contribution  of  the  current  conditional  variance  vanishes  when 
we  look  further  and  further  into  the  future.  The  forecast  of  the  conditional  variance 
then  approaches  the  unconditional  one:  lim/^oo  PrCT2+;i  =  .  If  cti  +  /)  =  1 

as  in  the  IGARCH  model,  the  contribution  of  the  current  conditional  variance  is 
constant,  but  diminishes  to  zero  relative  to  the  first  term.  Finally,  if  ai  +  ft  >  1,  the 
two  terms  are  of  the  same  order  and  we  have  a  particularly  persistent  situation. 

In  practice,  the  parameters  of  the  model  are  unknown  and  have  therefore  be 
replaced  by  an  estimate.  The  method  can  be  easily  adapted  for  higher  order 
models.  Instead  of  using  the  recursive  approach  outlined  above,  it  is  possible  to  use 
simulation  methods  by  drawing  repeatedly  from  the  actual  empirical  distribution  of 
the  v,  =  Z,lat.  This  has  the  advantage  to  capture  deviations  from  the  underlying 
distributional  assumptions  (see  Sect.  8.4  for  a  comparison  of  both  methods).  Such 
methods  must  be  applied  if  nonlinear  models  for  the  conditional  variance,  like  the 
TARCH  model,  are  employed. 


10As  the  variance  becomes  infinite,  the  IGARCH  process  is  an  example  of  a  stochastic  process 
which  is  strictly  stationary,  but  not  stationary. 
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8.2  Tests  for  Heteroskedasticity 

Before  modeling  the  volatility  of  a  time  series  it  is  advisable  to  test  whether 
heteroskedasticity  is  actually  present  in  the  data.  For  this  purpose  the  literature 
proposed  several  tests  of  which  we  are  going  to  examine  two.  For  both  tests  the 
null  hypothesis  is  that  there  is  no  heretoskedasticity  i.e.  that  there  are  no  ARCH 
effects.  These  tests  can  also  be  useful  in  a  conventional  regression  setting. 


8.2.1  Autocorrelation  of  Quadratic  Residuals 


The  first  test  is  based  on  the  autocorrelation  function  of  squared  residuals  from  a 
preliminary  regression.  This  preliminary  regression  or  mean  regression  produces  a 
series  Z,  which  should  be  approximately  white  noise  if  the  equation  is  well  specified. 
Then  we  can  look  at  the  ACF  of  the  squared  residuals  {Zj  j  and  apply  the  Ljung-Box 
test  (see  Eq.  (4.4)).  Thus  the  test  can  be  broken  down  into  three  steps. 


(i)  Estimate  an  ARMA  model  for  {A,}  and  retrieve  the  residuals  Z,  from  this 
model.  Compute  Z2.  These  data  can  be  used  to  estimate  a2  as 


Note  that  the  ARMA  model  should  be  specified  such  that  the  residuals  are 
approximately  white  noise. 

[(ii)  ]  Estimate  the  ACF  for  the  squared  residuals  in  the  usual  way: 


P-z(h)  = 


£,r=1(z?-a2)2 


(iii)  It  is  now  possible  to  use  one  of  the  methods  laid  out  in  Chap.  4  to  test  the 
null  hypothesis  that  {Z2}  is  white  noise.  It  can  be  shown  that  under  the  null 

hypothesis  y/Tpzi(h) - »  N(0, 1).  One  can  therefore  construct  confidence 

intervals  for  the  ACF  in  the  usual  way.  Alternatively,  one  may  use  the  Ljung- 
Box  test  statistic  (see  Eq.  (4.4))  to  test  the  hypothesis  that  all  correlation 
coefficients  up  to  order  N  are  simultaneously  equal  to  zero. 


N 

Q'  =  T(T  +  2)  ^ 

h=l 


P2zi(h) 
T  —  h 


Under  the  null  hypothesis  this  statistic  is  distributed  as  xh-  To  carry  out  the 
test,  N  should  be  chosen  rather  high,  for  example  equal  to  T/ A. 
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8.2.2  Engle's  Lagrange-MultiplierTest 

Engle  (1982)  proposed  a  Lagrange-Multiplier  test.  This  test  rests  on  an  ancil¬ 
lary  regression  of  the  squared  residuals  against  a  constant  and  lagged  values  of 
Z2_  | ,  Z2_, , . . . ,  Z2_/;  where  the  {Z,}  is  again  obtained  from  a  preliminary  regression. 
The  auxiliary  regression  thus  is 

Z2  =  o?o  +  ot\Z~_x  +  aiZj_2  +  . . .  +  apZ2_p  +  st, 

where  s,  denotes  the  error  term.  Then  the  null  hypothesis  Ho  :  a\  =  012  =  ■  ■  ■  = 
ap  =  0  is  tested  against  the  alternative  hypothesis  Hi  :  ctj  yf  0  for  at  least  one  j.  As 
a  test  statistic  one  can  use  the  coefficient  of  determination  times  T,  i.e.  7’R2.  This 
test  statistic  is  distributed  as  a  /2  with  p  degrees  of  freedom.  Alternatively,  one  may 
use  the  conventional  F-test. 


8.3  Estimation  of  GARCH(p,q)  Models 
8.3.1  Maximum-Likelihood  Estimation 

The  literature  has  proposed  several  approaches  to  estimate  models  of  volatility  (see 
Fan  and  Yao  (2003,  156-162)).  The  most  popular  one,  however,  rest  on  the  method 
of  maximum-likelihood.  We  will  describe  this  method  using  the  GARCH(p,q) 
model.  Related  and  more  detailed  accounts  can  be  found  in  Weiss  (1986),  Bollerslev 
et  al.  (1994)  and  Hall  and  Yao  (2003). 

In  particular  we  consider  the  following  model: 

mean  equation:  X,  =  c  +  <p\Xt-\  +  . . .  +  <prXt-r  +  Zf, 

where 

Z,  =  vto,  with  u,  ~  IIDN(0,  1)  and 

p  q 

variance  equation:  a 2  =  ao  +  E  ajZH  +  E  M-r 

7=1  7=1 

The  mean  equation  represents  a  simple  AR(r)  process  for  which  we  assume  that  it  is 
causal  with  respect  to  {Z,},  i.e.  that  all  roots  of  <t>(z)  are  outside  the  unit  circle.  The 
method  demonstrated  here  can  be  easily  generalized  to  ARMA  processes  or  even 
ARMA  process  with  additional  exogenous  variables  (so-called  ARMAX  processes) 
as  noted  by  Weiss  (1986).  The  method  also  incorporates  the  ARCH-in-mean  model 
(see  equation  (8.6))  which  allows  for  an  effect  of  the  conditional  variance  a,  on  X,. 
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In  addition,  we  assume  that  the  coefficients  of  the  variance  equation  are  all  positive, 
that  Ylj=  i  aj  +  Ylj=  1  Pi  <  1  and  that  E Zf  <  oo  exists.11 

As  vt  is  identically  and  independently  standard  normally  distributed,  the  dis¬ 
tribution  of  Xt  conditional  on  Xt- \  =  {Xt-\,Xt-2 _ }  is  normal  with  mean 

c  +  4>\Xt~\  +  . . .  +  (j)rXt-r  and  variance  of.  The  conditional  density,  f(Xt\Xt-i), 
therefore  is: 


f(Xt\Xt-\) 


where  Z,  equals  X,  —  c  —  <p \X,-\  —  ...  —  <prXt-r  and  ct,2  is  given  by  the  variance 
equation.12  The  joint  density  f(X\,X2, . . .  ,Xf)  of  arandom  sample  ( X1.X2 , . . .  ,Xf) 
can  therefore  be  factorized  as 


T 

f(XuX2,  ...,XT)  =f(XuX 2, . .  .,Ys-i)  Y\f(Xt \X,-i) 

t=s 

where  .v  is  an  integer  greater  than  p.  The  necessity,  not  to  factorize  the  first  s  —  1 
observations,  relates  to  the  fact  that  of  can  only  be  evaluated  for  s  >  p  in  the 
ARCH(p)  model.  For  the  ARCH(p)  model  s  can  be  set  to  p  +  1.  In  the  case  of  a 
GARCH  model  of  is  given  by  weighted  infinite  sum  of  the  Z2_j,Z2_2,  . . .  (see  the 
expression  (8.9)  for  of  in  the  GARCH(1,1)  model).  For  finite  samples,  this  infinite 
sum  must  be  approximated  by  a  finite  sum  of  s  summands  such  that  the  numbers  of 
summands  .v  is  increasing  with  the  sample  size. (see  Hall  and  Yao  (2003)). 

We  then  merge  all  parameters  of  the  model  as  follows:  tp  =  ( c.<p\ . (/),  )',  a  = 

(ao,  ai . apY  and  pi  =  (fii, ... ,  j5q)' .  For  a  given  realization  x  =  Gi ,  X2 . Xt) 

the  likelihood  function  conditional  on  x,  L(0,  a,  ft  |x),  is  defined  as 

T 

L((/>,  a.  f\x)  =f(x  i,x2, . . .  ,xs_i)  n/^-o 

t=s 


where  in  X,-\  the  random  variables  are  replaced  by  their  realizations.  The  likelihood 
function  can  be  seen  as  the  probability  of  observing  the  data  at  hand  given  the  values 
for  the  parameters.  The  method  of  maximum  likelihood  then  consist  in  choosing  the 
parameters  (</>,  a,  f)  such  that  the  likelihood  function  is  maximized.  Thus  we  chose 
the  parameter  so  that  the  probability  of  observing  the  data  is  maximized.  In  this  way 


"The  existence  of  the  fourth  moment  is  necessary  for  the  asymptotic  normality  of  the  maximum- 
likelihood  estimator,  but  not  for  the  consistence.  It  is  possible  to  relax  this  assumption  somewhat 
(see  Hall  and  Yao  (2003)). 

12If  y,  is  assumed  to  follow  another  distribution  than  the  normal,  one  may  use  this  distribution 
instead. 


186 


8  Models  of  Volatility 


we  obtain  the  maximum  likelihood  estimator.  Taking  the  first  s  realizations  as  given 
deterministic  starting  values,  we  then  get  the  conditional  likelihood  function. 

In  practice  we  do  not  maximize  the  likelihood  function  but  the  logarithm  of  it 

where  we  take  f(x\ . xs_i )  as  a  fixed  constant  which  can  be  neglected  in  the 

optimization: 


T 


logL(</>,  a,  P\x)  =  £  log/ (*,!*,) 


t=S 


where  it  =  x,  —  c  —  cp \xt~\  —  ...  —  <prx,-r  denotes  the  realization  of  Zt.  The 
maximum  likelihood  estimator  is  obtained  by  maximizing  the  likelihood  function 
over  the  admissible  parameter  space.  Usually,  the  implementation  of  the  stationarity 
condition  and  the  condition  for  the  existence  of  the  fourth  moment  turns  out  to 
be  difficult  and  cumbersome  so  that  often  these  conditions  are  neglected  and  only 
checked  in  retrospect  or  some  ad  hoc  solutions  are  envisaged.  It  can  be  shown  that 
the  (conditional)  maximum  likelihood  estimator  leads  to  asymptotically  normally 
distributed  estimates.13  The  maximum  likelihood  estimator  remains  meaningful 
even  when  { v,\  is  not  normally  distributed.  In  this  case  the  quasi  maximum 
likelihood  estimator  is  obtained  (see  Hall  and  Yao  (2003)  and  Fan  and  Yao  (2003)). 

For  numerical  reasons  it  is  often  convenient  to  treat  the  mean  equation  and  the 
variance  equation  separately.  As  the  mean  equation  is  a  simple  AR(r)  model,  it  can 
be  estimated  by  ordinary  least-squares  (OLS)  in  a  first  step.  This  leads  to  consistent 
parameter  estimates.  However,  due  to  the  heteroskedasticity,  this  no  longer  true  for 
the  covariance  matrix  of  the  coefficients  so  that  the  usual  t-  and  F-tests  are  nor 
reliable.  This  problem  can  be  circumvented  by  the  use  of  the  White  correction  (see 
White  (1980)).  In  this  way  it  is  possible  to  find  an  appropriate  specification  for 
the  mean  equation  without  having  to  estimate  the  complete  model.  In  the  second 
step,  one  can  then  work  with  the  residuals  to  find  an  appropriate  ARMA  model 
for  the  squared  residuals.  This  leads  to  consistent  estimates  of  the  parameters  of 
the  variance  equation.  These  estimates  are  under  additional  weakly  assumptions 
asymptotically  normally  distributed  (see  Weiss  (1986)).  It  should,  however,  be  noted 
that  this  way  of  proceeding  is,  in  contrast  to  the  maximum  likelihood  estimator, 
not  efficient  because  it  neglect  the  nonlinear  character  of  the  GARCH  model.  The 
parameters  found  in  this  way  can,  however,  serve  as  meaningful  starting  values  for 
the  numerical  maximization  procedure  which  underlies  the  maximum  likelihood 
estimation. 


13Jensen  and  Rahbek  (2004)  showed  that,  at  least  for  the  GARCH(1,1)  case,  the  stationarity 
condition  is  not  necessary. 
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A  final  remark  concerns  the  choice  of  the  parameter  r,  p  and  q.  Similarly  to  the 
ordinary  ARMA  models,  one  can  use  information  criteria  such  as  the  Akaike  or  the 
Bayes  criterion,  to  determine  the  order  of  the  model  (see  Sect.  5.4). 


8.3.2  Method  of  Moment  Estimation 


The  maximization  of  the  likelihood  function  requires  the  use  of  numerical  opti¬ 
mization  routines.  Depending  on  the  routine  actually  used  and  on  the  starting  value, 
different  results  may  be  obtained  if  the  likelihood  function  is  not  well-behaved.  It 
is  therefore  of  interest  to  have  alternative  estimation  methods  at  hand.  The  method 
of  moments  is  such  an  alternative.  It  is  similar  to  the  Yule- Walker  estimator  (see 
Sect.  5.1)  applied  to  the  autocorrelation  function  of  { Zf2  j .  This  method  not  only  leads 
to  an  analytic  solution,  but  can  also  be  easily  implemented.  Following  Kristensen 
and  Linton  (2006),  we  will  illustrate  the  method  for  the  GARCH(1,1)  model. 

Equation  (8.11)  applied  to  pz 2(1)  and  pz 2(2)  constitutes  a  nonlinear  equation 
system  in  the  unknown  parameters  /3  and  oi\ .  This  system  can  be  reparameterized 
to  yield  an  equation  system  in  <p  =  a\  +  ft  and  /I  which  can  be  reduced  to  a  single 
quadratic  equation  in  /l : 

P-bt i-l=0  where  f,=  t,2  +  1~2fel(1)y. 

<P~Pz*(  1) 

The  parameter  b  is  well-defined  because  (p  =  at  +  ft  >  pZ2(l)  with  equality  only 
if  /3  =  0.  In  the  following  we  will  assume  that  fi  >  0.  Under  this  assumption  b  >  2 
so  that  the  only  solution  with  the  property  0  <  ft  <  1  is  given  by 

b  —  Vfr2  —  4 
P  = - 2 - ' 

The  moment  estimator  can  therefore  be  constructed  as  follows: 


(i)  Estimate  the  correlations  pz 2(1)  and  pzi(2)  and  o2  based  on  the  formulas  (8.11) 
in  Sect.  8.2. 

(ii)  An  estimate  for  <p  =  Qq  +  /I  is  then  given  by 


<P 


(“t  +  P)  = 


Pz1 (2) 
Pz*(l)' 


(iii)  use  the  estimate  <p  to  compute  an  estimate  for  b: 


£  =  <P2  +  1  -2pZ2{l){p 
<P~Pz>0) 
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The  estimate  ft  for  ft  is  then 


b  —  V& 


(iv)  The  estimate  for  o/\  is«i  =  (p  —  fi-  Because 
for  »o  is  equal  to  «o  =  o'2  ( 1  —  <p). 


cr2(l  —  (aq  +  /S)),  the  estimate 


Kristensen  and  Linton  (2006)  show  that,  given  the  existence  of  the  fourth  moment  of 
Z,,  this  method  of  moment  leads  to  consistent  and  asymptotically  normal  distributed 
estimates.  These  estimates  may  then  serve  as  starting  values  for  the  maximization 
of  the  likelihood  function  to  improve  efficiency. 


8.4  Example:  Swiss  Market  Index  (SMI) 

In  this  section,  we  will  illustrate  the  methods  discussed  previously  to  analyze  the 
volatility  of  the  Swiss  Market  Index  (SMI).  The  SMI  is  the  most  important  stock 
market  index  for  Swiss  blue  chip  companies.  It  is  constructed  solely  from  stock 
market  prices,  dividends  are  not  accounted  for.  The  data  are  the  daily  values  of  the 
index  between  the  3rd  of  January  1989  and  the  13th  of  February  2004.  Figure  1.5 
shows  a  plot  of  the  data.  Instead  of  analyzing  the  level  of  the  SMI,  we  will 
investigate  the  daily  return  computed  as  the  logged  difference.  This  time  series 
is  denoted  by  X,  and  plotted  in  Fig.  8.3.  One  can  clearly  discern  phases  of  high 
(observations  around  t  =  2500  and  t  =  3500)  and  low  ( t  =  1000  and  t  =  2000) 
volatility.  This  represents  a  first  sign  of  heteroskedasticity  and  positively  correlated 
volatility. 


Fig.  8.3  Daily  return  of  the  SMI  (Swiss  Market  Index)  computed  as  A  log(SMI,)  between  lanuary 
3rd  1989  and  February  13th  2004 
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daily  return  of  SMI 


Fig.  8.4  Normal-Quantile  Plot  of  the  daily  returns  of  the  SMI  (Swiss  Market  Index) 


Figure  8.4  shows  a  normal-quantile  plot  to  compare  the  empirical  distribution  of 
the  returns  with  those  from  a  normal  distribution.  This  plot  clearly  demonstrates  that 
the  probability  of  observing  large  returns  is  bigger  than  warranted  from  a  normal 
distribution.  Thus  the  distribution  of  returns  exhibits  the  heavy-tail  property.  A 
similar  argument  can  be  made  by  comparing  the  histogram  of  the  returns  and  the 
density  of  a  normal  distribution  with  the  same  mean  and  the  same  variance,  shown 
in  Fig.  8.5.  Again  one  can  seen  that  absolutely  large  returns  are  more  probable  than 
expected  from  a  normal  distribution.  Moreover,  the  histogram  shows  no  obvious 
sign  for  an  asymmetric  distribution,  but  a  higher  peakedness. 

After  the  examination  of  some  preliminary  graphical  devices,  we  are  going 
to  analyze  the  autocorrelation  functions  of  {A,}  and  \Xf\.  Figure  8.6  shows  the 
estimated  ACFs.  The  estimated  ACF  of  {A,}  shows  practically  no  significant 
autocorrelation  so  that  we  can  consider  {A,}  be  approximately  white  noise.  The 
corresponding  Ljung-Box  statistic  with  L  =  100,  however,  has  a  value  of  129.62 
which  is  just  above  the  5  %  critical  value  of  124.34.  Thus  there  is  some  sign  of  weak 
autocorrelation.  This  feature  is  not  in  line  with  efficiency  of  the  Swiss  stock  market 
(see  Campbell  et  al.  (1997)).  The  estimated  ACF  of  Xj  is  clearly  outside  the  95  % 
confidence  interval  for  at  least  up  to  order  20.  Thus  we  can  reject  the  hypothesis  of 
homoskedasticity  in  favor  of  heteroskedasticity.  This  is  confirmed  by  the  Ljung-Box 
statistic  with  L  =  100  with  a  value  of  2000.93  which  is  much  higher  than  the  critical 
value  of  124.34. 

After  these  first  investigations,  we  want  to  find  an  appropriate  model  for  the  mean 
equation.  We  will  use  OLS  with  the  White-correction.  It  turns  out  that  a  MA(1) 
model  fits  the  data  best  although  an  AR(  1 )  model  leads  to  almost  the  same  results.  In 
the  next  step  we  will  estimate  a  GARCH(p,q)  model  with  the  method  of  maximum 


correlation  coefficient  correlation  coefficient 
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Fig.  8.5  Histogram  of  the  daily  returns  of  the  SMI  (Swiss  Market  Index)  and  the  density  of  a  fitted 
normal  distribution  (red  line) 


A  log(SMIt) 


Fig.  8.6  ACF  of  the  daily  returns  and  the  squared  daily  returns  of  the  SMI 


8.4  Example:  Swiss  Market  Index  (SMI) 


191 


Table  8.1  AIC  criterion  for 
the  variance  equation  in  the 
GARCH(p,q)  model 


Table  8.2  BIC  criterion  for 
the  variance  equation  in  the 
GARCH(p.q)  model 


q 

p 

0 

1 

2 

3 

1 

3.0886 

2.9491 

2.9491 

2.9482 

2 

3.0349 

2.9496 

2.9491 

2.9486 

3 

2.9842 

2.9477 

2.9472 

2.9460 

Minimum  value  in  bold 


q 

P 

0 

1 

2 

3 

1 

3.0952 

2.9573 

2.9590 

2.9597 

2 

3.0431 

2.9595 

2.9606 

2.9617 

3 

2.9941 

2.9592 

2.9604 

2.9607 

Minimum  value  in  bold 


likelihood  where  p  is  varied  between  p  1  and  3  and  q  between  0  and  3.  The  values 
of  the  AIC,  respectively  BIC  criterion  corresponding  to  the  variance  equation  are 
listed  in  Tables  8.1  and  8.2. 

The  results  reported  in  these  tables  show  that  the  AIC  criterion  favors  a 
GARCH(3,3)  model  corresponding  to  the  bold  number  in  Table  8.1  whereas  the 
BIC  criterion  opts  for  a  GARCH(1,1)  model  corresponding  to  the  bold  number 
in  Table  8.2.  It  also  turns  out  that  high  dimensional  models,  in  particular  those 
for  which  q  >  0,  the  maximization  algorithm  has  problems  to  find  an  optimum. 
Furthermore,  the  roots  of  the  implicit  AR  and  the  MA  polynomial  corresponding 
to  the  variance  equation  of  the  GARCH(3,3)  model  are  very  similar.  These  two 
arguments  lead  us  to  prefer  the  GARCH(1,1)  over  the  GARCH(3,3)  model.  This 
model  was  estimated  to  have  the  following  mean  equation: 


X,  =  0.0755 
(0.0174) 


+  Z,  +  0.0484  Z,_i 
(0.0184) 


with  the  corresponding  variance  equation 


0.0765  +  0.1388  zL  +  0.8081  of_ 
(0.0046)  (0.0095)  (0.0099) 


where  the  estimated  standard  deviations  are  reported  below  the  corresponding 
coefficient  estimate.  The  small,  but  significant  value  of  0.0484  for  the  MA(1) 
coefficient  shows  that  there  is  a  small  but  systematic  correlation  of  the  returns  from 
one  day  to  the  next.  The  coefficients  of  the  GARCH  model  are  all  positive  and  their 
sum  oi\  +  fi  =  0.1388  +  0.8081  =  0.9469  is  statistically  below  one  so  that  all 
conditions  for  a  stationary  process  are  fulfilled.14  Because  =  V3  ^ogcmt  = 

1 .2528  >  1,  the  condition  for  the  existence  of  the  fourth  moment  of  Z,  is  violated. 


14The  corresponding  Wald  test  clearly  rejects  the  null  hypothesis  ori  +  yS  =  lata  significance 
level  of  1  %. 
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As  a  comparison  we  also  estimate  the  GARCH(1,1)  model  using  the  methods 
of  moments.  First  we  estimate  a  MA(1)  model  for  A  log  SMI.  This  results  in  an 
estimate  9  =  0.034  (compare  this  with  the  ML  estimate).  The  squared  residuals 
have  correlation  coefficients 

pz i  ( 1 )  =  0.228  and  pzi  (2)  =  0. 1 8 1 . 

The  estimate  of  ft  therefore  is  ft  =  2.241  which  leads  to  an  estimate  of  ft  equal  to 
P  =  0.615.  This  finally  results  in  the  estimates  of  and  «o  equal  to  a\  =  0.179 
and  «o  =  0.287  with  an  estimate  for  a2  equal  to  o2  =  1.391.  Thus  these  estimates 
are  quite  different  from  those  obtained  by  the  ML  method. 


Value  at  Risk 

We  are  now  in  a  position  to  use  our  ML  estimates  to  compute  the  Value-at-risk 
(VaR).  The  VaR  is  a  very  popular  measure  to  estimate  the  risk  of  an  investment.  In 
our  case  we  consider  the  market  portfolio  represented  by  the  stocks  in  the  SMI.  The 
VaR  is  defined  as  the  maximal  loss  (in  absolute  value)  of  an  investment  which  occurs 
with  probability  a  over  a  time  horizon  h.  Thus  a  1  %  VaR  for  the  return  on  the  SMI 
for  the  next  day  is  the  threshold  value  of  the  return  such  that  one  can  be  confident 
with  99  %  that  the  loss  will  not  exceed  this  value.  Thus  the  a  VaR  at  time  t  for  h 
periods,  VaR“,+/l,  is  nothing  but  the  a-quantile  of  the  distribution  of  the  forecast  of 
the  return  in  h  periods  given  information  Xt-k,  k  =  0, 1,2, _ Formally,  we  have: 

VaR“  +h  =  inf  {x  :  P  \xt+h  <  x\Xt,X,-\ , . . .]  >  a}  , 

where  Xt+h  is  the  return  of  the  portfolio  over  an  investment  horizon  of  h  periods. 
This  return  is  approximately  equal  to  the  sum  of  the  daily  returns:  Xt+h  = 

EjU  x<+j- 

The  one  period  forecast  error  is  given  by  Xt+i  —  f‘,X,+  \  which  is  equal  to  Z,+  i  = 
crt+ 1  ift+ 1  ■  Thus  the  VaR  for  the  next  day  is 


VaR 


a 

t,t~ (- 1 


|  x :  P 

r-PA+i 

Vt+l  < 

>  a 

( 

OV+l 

) 

This  entity  can  be  computed  by  replacing  the  forecast  given  the  infinite  past,  P, Xt+ 1 , 
by  a  forecast  given  the  finite  sample  information  Xr k  =  0, 1, 2, ....  t—  1,  WtXt+i, 
and  by  substituting  at+\  by  the  corresponding  forecast  from  variance  equation, 
ctt  ,+\.  Thus  we  get: 


VaR“+1 


x:P 

x  —  PA+i  1 

Vr+l  < 

( 

07,f+l 
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Table  8.3  One  percent  VaR 
for  the  next  day  of  the  return 
to  the  SMI  according  to  the  Date 

PA+i 

&U+1 

VaR  (VaR^) 

Parametric 

Non-parametric 

ARM  A(0, 1  )-GARCH(  1,1)  31.12.2001 

0.28 

6.61 

5.71 

6.30 

modd  5.2.2002 

-0.109 

6.80 

6.19 

6.79 

24.7.2003 

0.0754 

0.625 

1.77 

1.95 

Table  8.4  One  percent  VaR 
for  the  next  10  days  of  the 
return  to  the  SMI  according 
to  the  ARMA(0,1) 

-GARCH(  1,1)  model 


VaR(VaR°;+10) 

Date 

P,i,+i 

Parametric 

Non-parametric 

31.12.20m 

0.84 

18.39 

22.28 

5.2.2002 

0.65 

19.41 

21.53 

24.7.2003 

0.78 

6.53 

7.70 

The  computation  of  VaR“/+1  requires  to  determine  the  cr-quantile  of  the  distribution 
of  vt.  This  can  be  done  in  two  ways.  The  first  one  uses  the  assumption  about  the 
distribution  of  v,  explicitly.  In  the  simplest  case,  vt  is  distributed  as  a  standard 
normal  so  that  the  appropriate  quantile  can  be  easily  retrieved.  The  1  %  quantile  for 
the  standard  normal  distribution  is  —2.33.  The  second  approach  is  a  non-parametric 
one  and  uses  the  empirical  distribution  function  of  u,  =  Z,/a,  to  determine  the 
required  quantile.  This  approach  has  the  advantage  that  deviations  from  the  standard 
normal  distribution  are  accounted  for.  In  our  case,  the  1  %  quantile  is  —2.56  and  thus 
considerably  lower  than  the  —2.33  obtained  from  the  normal  distribution.  Thus  the 
VaR  is  under  estimated  by  using  the  assumption  of  the  normal  distribution. 

The  corresponding  computations  for  the  SMI  based  on  the  estimated 
ARMA(0, 1  )-GARCH(  1,1)  model  are  reported  in  Table  8.3.  A  value  of  5.71  for 
31st  of  December  2001  means  that  one  can  be  99%  sure  that  the  return  of  an 
investment  in  the  stocks  of  the  SMI  will  not  be  lower  than  —5.71  %.  The  values  for 
the  non-parametric  approach  are  typically  higher.  The  comparison  of  the  VaR  for 
different  dates  clearly  shows  how  the  risk  evolves  over  time. 

Due  to  the  nonlinear  character  of  the  model,  the  VaR  for  more  than  one  day 
can  only  be  gathered  from  simulating  the  one  period  returns  over  the  corresponding 
horizon.  Starting  from  a  given  date  10’ 000  realizations  of  the  returns  over  the  next 
10  days  have  been  simulated  whereby  the  corresponding  values  for  v,  are  either 
drawn  from  a  standard  normal  distribution  (parametric  case)  or  from  the  empirical 
distribution  function  of  v,  (non-parametric  case).  The  results  from  this  exercise  are 
reported  in  Table  8.4.  Obviously,  the  risk  is  much  higher  for  a  10  day  than  for  a 
one  day  investment.  Alternatively,  one  may  use  the  forecasting  equation  (8.12)  and 
the  corresponding  recursion  formula  (8.13). 


Part  II 
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The  Keynesian  macroeconomic  theory  developed  in  the  1930s  and  1940s,  in 
particular  its  representation  in  terms  of  IS-  and  LM-diagram,  opened  a  new  area 
in  the  application  of  statistical  methods  to  economics.  Based  on  the  path  breaking 
work  by  Tinbergen  (1939)  and  Klein  (1950)  this  research  gave  rise  to  simultaneous 
equation  systems  which  should  capture  all  relevant  aspect  of  an  economy.  The  goal 
was  to  establish  an  empirically  grounded  tool  which  would  enable  the  politicians 
to  analyze  the  consequences  of  their  policies  and  thereby  fine  tune  the  economy  to 
overcome  or  at  least  mitigate  major  business  cycle  fluctuations.  This  development 
was  enhanced  by  the  systematic  compilation  of  national  accounting  data  and  by 
the  advances  in  computer  sciences.1  These  systems  were  typically  built  by  putting 
together  single  equations  such  as  consumption,  investment,  money  demand,  export- 
and  import,  Phillips-curve  equations  to  an  overall  model.  The  Klein-Goldberger 
model  for  the  United  States  was  a  first  successful  attempt  in  this  direction  (Klein  and 
Goldberger  1955).  Shocked  by  disturbances,  Adelman  and  Adelman  (1959)  showed 
that  these  type  of  models  exhibited  cycles  with  properties  similar  to  those  found  for 
the  United  States  Economy. 

As  the  model  became  more  and  detailed  over  time,  they  could,  in  the  end,  well 
account  for  several  hundreds  or  even  thousands  of  equations.  The  climax  of  this 
development  was  the  project  LINK  which  linked  the  different  national  models  to  a 
world  model  by  accounting  for  their  interrelation  through  trade  flows  (Klein  1985). 
Although  this  research  program  brought  many  insights  and  spurred  the  development 
of  econometrics  as  a  separate  field,  by  the  mid  1970s  one  had  to  admit  that  the 
idea  to  use  large  and  very  detailed  models  for  forecasting  and  policy  analysis  was 
overly  optimistic.  In  particular,  the  inability  to  forecast  and  cope  with  the  oil  crisis 
of  the  beginning  1970s  raised  doubts  about  the  viability  of  this  research  strategy.  In 
addition,  more  and  more  economist  had  concerns  about  the  theoretical  foundations 
of  these  models. 


'See  Epstein  (1987)  for  an  historical  overview. 
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The  critique  had  several  facets.  First,  it  was  argued  that  the  bottom-up  strategy  of 
building  a  system  from  single  equations  is  not  compatible  with  general  equilibrium 
theory  which  stresses  the  interdependence  of  economic  activities.  This  insight  was 
even  reinforced  by  the  advent  of  the  theory  of  rational  expectations.  This  theory 
postulated  that  expectations  should  be  formed  on  the  basis  of  all  available  infor¬ 
mation  and  not  just  by  mechanically  extrapolating  from  the  past.  This  implies  that 
developments  in  every  part  of  the  economy,  in  particular  in  the  realm  of  economic 
policy  making,  should  in  principle  be  taken  into  account  and  shape  the  expectation 
formation.  As  expectations  are  omnipresent  in  almost  every  economic  decision, 
all  aspects  of  economic  activities  (consumption  capital  accumulation,  investment, 
etc.)  are  inherently  linked.  Thus  the  strategy  of  using  zero  restrictions — which 
meant  that  certain  variables  were  omitted  from  a  particular  equation — to  identify  the 
parameters  in  a  simultaneous  equation  system  was  considered  to  be  flawed.  Second, 
the  theory  of  rational  expectations  implied  that  the  typical  behavioral  equations 
underlying  these  models  are  not  invariant  to  changes  in  policies  because  economic 
agents  would  take  into  account  systematic  changes  in  the  economic  environment 
in  their  decision  making.  This  so-called  Lucas-critique  (Lucas  1976)  undermined 
the  basis  for  the  existence  of  large  simultaneous  equation  models.  Third,  simple 
univariate  ARMA  models  proved  to  be  as  good  in  forecasting  as  the  sophisticated 
large  simultaneous  models.  Thus  it  was  argued  that  the  effort  or  at  least  part  of  the 
effort  devoted  to  these  models  was  wasted. 

In  1980  Sims  (1980b)  and  proposed  an  alternative  modeling  strategy.  This 
strategy  concentrates  the  modeling  activity  to  only  a  few  core  variables,  but  places 
no  restrictions  what  so  ever  on  the  dynamic  interrelation  among  them.  Thus  every 
variable  is  considered  to  be  endogenous  and,  in  principle,  dependent  on  all  other 
variables  of  the  model.  In  the  linear  context,  the  class  of  vector  autoregressive 
(VAR)  models  has  proven  to  be  most  convenient  to  capture  this  modeling  strategy. 
They  are  easy  to  implement  and  to  analyze.  In  contrast  to  the  simultaneous  equation 
approach,  however,  it  is  no  longer  possible  to  perform  comparative  static  exercises 
and  to  analyze  the  effect  of  one  variable  on  another  one  because  every  variable 
is  endogenous  a  priori.  Instead,  one  tries  to  identify  and  quantify  the  effect  of 
shocks  over  time.  These  shocks  are  usually  given  some  economic  content,  like 
demand  or  supply  disturbances.  However,  these  shocks  are  not  directly  observed, 
but  are  disguised  behind  the  residuals  from  the  VAR.  Thus,  the  VAR  approach  also 
faces  a  fundamental  identification  problem.  Since  the  seminal  contribution  by  Sims, 
the  literature  has  proposed  several  alternative  identification  schemes  which  will  be 
discussed  in  Chap.  15  under  the  header  of  structural  vector  autoregressive  (SVAR) 
models.  The  effects  of  these  shocks  are  then  further  analyzed  by  computing  impulse 
responses  and  forecast  error  variance  decompositions.2 

The  reliance  on  shocks  can  be  seen  as  a  substitute  for  the  lack  of  experiments 
in  macroeconomics.  The  approach  can  be  interpreted  as  a  statistical  analogue 
to  the  identification  of  specific  episodes  where  some  unforseen  event  (shock) 


2Watson  (1994)  and  Kilian  (2013)  provide  a  general  introduction  to  this  topic. 
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impinges  on  and  propagates  throughout  the  economy.  Singling-out  these  episodes  of 
quasi  “natural  experiments”  as  in  Friedman  and  Schwartz  (1963)  convinced  many 
economist  of  the  role  and  effects  of  monetary  policy. 

Many  concepts  which  have  been  introduced  in  the  univariate  context  carry  over 
in  a  straightforward  manner  to  the  multivariate  context.  However  there  are  some 
new  aspects.  First,  we  will  analyze  the  interaction  among  several  variables.  This  can 
be  done  in  a  nonparametric  way  by  examining  the  cross-correlations  between  time 
series  or  by  building  an  explicit  model.  We  will  restrict  ourself  to  the  class  of  VAR 
models  as  they  are  easy  to  handle  and  are  overwhelmingly  used  in  practice.  Second, 
we  will  discuss  several  alternative  approaches  to  identify  the  structural  shocks  from 
VAR  models.  After  analyzing  the  identification  problem  in  general,  we  describe 
short-run,  long-run,  and  sign  restrictions  as  possible  remedies.  Third,  we  will 
discuss  the  modeling  of  integrated  variables  in  a  more  systematic  way.  In  particular, 
we  will  extend  the  concept  of  cointegration  to  more  than  two  variables.  Finally, 
we  will  provide  an  introduction  to  the  state  space  models  as  a  general  modeling 
approach.  State  space  models  are  becoming  increasingly  popular  in  economics  as 
they  can  be  more  directly  linked  to  theoretical  economic  models. 


Definitions  and  Stationarity 


10 


Similarly  to  the  univariate  case,  we  start  our  exposition  with  the  concept  of 
stationarity  which  is  also  crucial  in  the  multivariate  setting.  Before  doing  so  let  us 
define  the  multivariate  stochastic  process. 

Definition  10.1.  A  multivariate  stochastic  Process,  {X,},  is  a  family  of  random 
variables  indexed  by  t,  t  e  Z,  which  take  values  in  R",  n  >  1.  n  is  called  the 
dimension  of  the  process. 

Setting  n  =  1 ,  the  above  definition  includes  as  a  special  case  univariate  stochastic 
processes.  This  implies  that  the  statements  for  multivariate  processes  carry  over 
analogously  to  the  univariate  case.  We  view  X,  as  a  column  vector: 


X,  = 


/Xu\ 


\Xj 


Each  element  J  X,t }  thereby  represents  a  particular  variable  which  may  be  treated  as  a 
univariate  process.  As  in  the  example  of  Sect.  15.4.5,  {X,}  represents  the  multivariate 
process  consisting  of  the  growth  rate  of  GDP  Yt,  the  unemployment  rate  Ut,  the 
inflation  rate  Pt ,  the  wage  inflation  rate  W,,  and  the  growth  rate  of  money  M,.  Thus, 
X,  =  (Yt,U„Pt,Wt,Mt)'. 

As  in  the  univariate  case,  we  characterize  the  joint  distribution  of  the  elements 
Xj,  and  Xj,  by  the  first  two  moments  (if  they  exist),  i.e.  by  the  mean  and  the  variance, 
respectively  covariance: 

plt  =  EX n,  i  = 

Yij(t ,  s)  =  E(A„  -  jiu)(Xjs  -  njs),  i.j  =  1 . n;  t,s  6Z  (10. 1) 

It  is  convenient  to  write  these  entities  compactly  as  vectors  and  matrices: 
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/ViA 


/EX  i  A 


Mr  = 


=  EX,  = 


\EX,„/ 


/yn(f?s)  ...  yi„(f,s)\ 


r  (;,  .s)  = 


=  E(X,  -  M,)(X,  -  Mv)' 


\y„l(f.  A  .  .  .  Ynn(t,s)J 


Thus,  we  apply  the  expectations  operator  element-wise  to  vectors  and  matrices.  The 
matrix-valued  function  F(f,  ,y)  is  called  the  covariance  function  of  {X,}. 

In  analogy  to  the  univariate  case,  we  define  stationarity  as  the  invariance  of  the 
first  two  moments  to  time  shifts: 

Definition  10.2  (Stationarity).  A  multivariate  stochastic  process  {X,}  is  stationary 
if  and  only  if  for  all  integers  r,  s  and  t  we  have: 

(i)  p  =  p,  =  EX,  is  constant  (independent  of  t); 

(ii)  EX;X,  <  oo; 

( iii)  T(f,  s)  =  T(f  +  r,  s  +  r). 

In  the  literature  these  properties  are  often  called  weak  stationarity,  covariance 
stationarity,  or  stationarity  of  second  order.  If  {X,}  is  stationary,  the  covariance 
function  only  depends  on  the  number  of  periods  between  t  and  s  (i.e.  on  t  —  s) 
and  not  on  t  or  s  themselves.  This  implies  that  by  setting  r  =  —s  and  h  =  t  —  s  the 
covariance  function  simplifies  to 


T(/j)  =  F(t  -s)  =  T(t+r,s  +  r)  =  T(t  +  K  t) 

=  E(X,+,  -  m)(X,  -  H)'  =  E(X,  -  p)(X,-h  -  p)'. 


For  h  =  0,  T(0)  is  the  unconditional  covariance  matrix  of  X,.  Using  the  definition 
of  the  covariances  in  Eq.  (10.1)  we  get: 


T(h)  =  T(—h)'. 


Note  that  T(/?)  is  in  general  not  symmetric  for  h  0  because  y,y (/?)  7^  Yji(h) 
for  h  0. 

Based  on  the  covariance  function  of  a  stationary  process,  we  can  define  the 
correlation  function  R(h)  where  R(h)  =  ( Pij(h))ij  with 
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In  the  case  i  ^  j  we  refer  to  the  cross-correlations  between  two  variables  { Xit  j  and 
{Xj,}.  The  correlation  function  can  be  written  in  matrix  notation  as 


R(h)  =  V-1/2T(h)V~1/2 


where  V  represents  a  diagonal  matrix  with  diagonal  elements  equal  to  yu  (0) .  Clearly, 
Pu( 0)  =  1 .  As  for  the  covariance  matrix  we  have  that  in  general  pij(h)  ^  Pji(h)  for 
h  0.  It  is  possible  that  pfili)  >  p,j(0).  We  can  summarize  the  properties  of  the 
covariance  function  by  the  following  theorem. 1 

Theorem  10.1.  The  covariance  function  of  a  stationary  process  { X,  j  has  the 
following  properties: 

(i)  For  all  heZ,  T(h)  =  T(-h)'; _ 

(ii)  for  all  h  G  Z,  \yy(h)\  <  y/yu( 0)  x  y(;(0); 

(iii)  for  each  i  =  1 .....  n,  yu(h)  is  a  univariate  autocovariance  function; 

(iv)  f2"‘k=  i  a'r^(r  ~  k)ak  >  Ofor  all  m  e  IN  and  all  a\ . am  e  R".  This  property 

is  called  non-negative  definiteness  (see  Property  4  in  Theorem  1.1  of  Sect.  1.3 
in  the  univariate  case). 

Proof.  Property  (i)  follows  immediately  from  the  definition.  Property  (ii)  follows 
from  the  fact  that  the  correlation  coefficient  is  always  smaller  or  equal  to  one.  ya(h) 
is  the  autocovariance  function  of  J Xit  j  which  delivers  property  (iii).  Property  (iv) 
follows  from  E  (J2'k=i  u'fiXt-k  —  p))2  >  0.  □ 

If  not  only  the  first  two  moments  are  invariant  to  time  shifts,  but  the  distribution 
as  a  whole  we  arrive  at  the  concept  of  strict  stationarity. 

Definition  10.3  (Strict  Stationarity).  A  process  multivariate  {A,}  is  called  strictly 
stationary  if  and  only  if  for  all  n  G  IN,  t\, ,  t„,  h  G  Z,  the  joint  distributions  of 
(Afj, . . . , X,J  and  (A,,+/,, . . . , Xtll+i,)  are  the  same. 


An  Example 

Consider  the  following  example  for  n  =  2: 


Xu  =  Z, 


X2t  =Z,  +  0.754-2 


where  Z,  ~  WN(0, 1).  We  then  have  p.  =  EX,  =  0.  The  covariance  function  is 
given  by 


1  We  leave  it  to  the  reader  to  derive  an  analogous  theorem  for  the  correlation  function. 
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The  covariance  function  is  zero  for  h  >  2.  The  values  for  h  <  0  are  obtained  from 
property  (i)  in  Theorem  10.1.  The  correlation  function  is: 


The  correlation  function  is  zero  for  h  >  2.  The  values  for  h  <  0  are  obtained  from 
property  (i)  in  Theorem  10.1. 

One  idea  in  time  series  analysis  is  to  construct  more  complicated  process  from 
simple  ones,  for  example  by  taking  moving-averages.  The  simplest  process  is  the 
white  noise  process  which  is  uncorrelated  with  its  own  past.  In  the  multivariate 
context  the  white  noise  process  is  defined  as  follows. 

Definition  10.4.  A  stochastic  process  {Z,}  is  called  ( multivariate )  white  noise 
process  with  mean  zero  and  covariance  matrix  E  >  0,  denoted  by  Z,  ~  WN(0,  E), 
if  it  is  stationary  {Z,}  and 


E Z,  =  0. 


T(/7)  = 


E,  h  =  0; 
0,  h  ^  0. 


If  {Z,}  is  not  only  white  noise,  but  independently  and  identically  distributed  we 
write  Z,  ~  IID(0,  E). 


Remark  10.1.  Even  if  each  component  of  {Z,,}  is  univariate  white  noise,  this  does 
not  imply  that  {Z,}  =  J  (Z|, , . . . ,  Zn,)'\  is  multivariate  white  noise.  Take,  for  example 


the  process  Z,  =  (m,,  m,_i  )'  where  u,  ~  WN(0,  a^).  Then  T(l) 


7^  0. 


Taking  moving  averages  of  a  white  noise  process  it  is  possible  to  generate  new 
stationary  processes.  This  leads  to  the  definition  of  a  linear  process. 

Definition  10.5.  A  stochastic  process  {Xt}  is  called  linear  if  there  exists  a  represen¬ 
tation 
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oo 

*  =  E  ^z>-j 

j=-oo 

whereZ,  ~  IID(0,  S)  and  where  the  sequence  {'I'/}  of  the  nxn  matrices  is  absolutely 
summable,  i.e.  E/S-oo  Ill'll  <  °°-  IfforaMj  <  0  'I 'j  =  0,  the  linear  process  is  also 
called  an  MA(oo)  process. 

Theorem  10.2.  A  linear  process  is  stationary  with  a  mean  of  zero  and  with 
covariance  function 

OO  OO 

T(h)  =  J2  Vj+hWj  =  J2  h  =  0,  ±1,  ±2 - 

j=  oo  j=-o o 

Proof.  The  required  result  is  obtained  by  applying  the  properties  of  {Zt}  to  T  ( h )  = 

EXt+hX',  =  linwoo  E  VjZt+h-j)  (E“=-m  ^A-k)' .  □ 

Remark  10.1.  The  same  conclusion  is  reached  if  { Z ,}  is  a  white  noise  process  and 
not  an  IID  process. 


Appendix:  Norm  and  Summability  of  Matrices 

As  in  the  definition  of  a  linear  process  it  is  often  necessary  to  analyze  the 

convergence  of  a  sequence  of  matrices  {'f'/},  j  =  0,  1, 2 . For  this  we  need  to 

define  a  norm  for  matrices.  The  literature  considers  different  alternative  approaches. 
For  our  purposes,  the  choice  is  not  relevant  as  all  norms  are  equivalent  in  the  finite 
dimensional  vector  space.  We  therefore  choose  the  Frobenius,  Hilbert-Schmidt  or 
Schur  norm  which  is  easy  to  compute.2  This  norm  treats  the  elements  of  a  m  x  n 
matrix  A  =  (aij)  as  an  element  of  the  E,mX"  Euclidean  space  and  defines  the  length 

of  A,  denoted  by  ||A||,as  ||A||  =  alj-  This  leads  to  the  formal  definition  below. 

Definition  10.6.  The  Frobenius.  Hilbert-Schmidt  or  Schur  norm  of  a  m  x  n  matrix 
A,  denoted  by  ||A||,  is  defined  as: 


l|A||2  =  ^fl|  =  tr(A'A)  =  J]Ai 

ij  i=  1 

where  trfA'A)  denotes  the  trace  of  A'  A,  i.e.  the  sum  of  the  diagonal  elements  of  A'  A, 
and  where  A;  are  the  n  eigenvalues  of  A'  A. 


2For  details  see  Meyer  (2000,  279ff). 
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1 0  Definitions  and  Stationarity 


The  matrix  norm  has  the  following  properties: 

|| A ||  >  0  and  ||A||  =  0  is  equivalent  to  A  =  0, 

||arA||  =  |of  |  ||A||  for  all  (tel, 

iwi  =  ii  a' ii , 

||  A  +  B||  <  ||  A  ||  +  ||B||  for  all  matrices  A  and  B  of  the  same  dimension, 

||AB||  <  ||A||||B||  for  all  conformable  matrices  A  and  B. 

The  last  property  is  called  submultiplicativity . 

A  sequence  of  matrices  j  =  0,  1 _ _  is  called  absolutely  summable  if  and 

only  if  £~0  II  ^ll  <  oo ;  the  sequence  is  said  to  be  quadratic  summable  if  and  only 
if  II  1 1 2  <  oo.  The  absolute  summability  implies  the  quadratic  summability, 
but  not  vice  versa. 

Corollary  10.1.  Absolute  summability  of{  T,!  is  equivalent  to  the  absolute  summa¬ 
bility  of  each  sequence  {['!'/]«},  k,l  =  \ .....  n,  i.e.  to  lim/_>0O  |[xh/]Jt/|  exists  and  is 
finite. 

Proof.  In  particular,  he  have: 


|[*,w  <  \\Vj 


n  n 


n  n 


Summation  over  j  gives 


Eu«*i  i  E  ii*jI  ±  £  £  £  Mai  =  £  £  £  i  i**i 

j=  0  7=0  7=0  i=l  /=1  *  =  1  /=  1  7=0 

The  absolute  convergence  of  each  sequence  {[VP/]*/},  k,l  =  1 follows  from 
the  absolute  convergence  of  {T >j}  by  the  first  inequality.  Conversely,  the  absolute 
convergence  of  { T;}  is  implied  by  the  absolute  convergence  of  each  sequence 
{ [T;]w},  k.  I  =  1 . n,  from  the  second  inequality.  □ 


Estimation  of  Mean  and  Covariance  Function 


1 1 .1  Estimators  and  Their  Asymptotic  Distributions 


We  characterize  the  stationary  process  {X,\  by  its  mean  and  its  (matrix)  covariance 
function.  In  the  Gaussian  case,  this  already  characterizes  the  whole  distribution.  The 
estimation  of  these  entities  becomes  crucial  in  the  empirical  analysis.  As  it  turns  out, 
the  results  from  the  univariate  process  carry  over  analogously  to  the  multivariate 
case.  If  the  process  is  observed  over  the  periods  t  =  1,2 , ,T,  then  a  natural 
estimator  for  the  mean  p  is  the  arithmetic  mean  or  sample  average: 


P 


Xt  —  —  (X\  + 


+  Xt)  — 


\xj 


We  get  a  theorem  analogously  to  Theorem  4. 1  in  Sect.  4. 1 . 

Theorem  11.1.  Let  {X,}  be  stationary  process  with  mean  p,  and  covariance  function 
T  ( h )  then  asymptotically,  for  T  — >  oo,  we  get 

E  [XT  —  p)'  [XT  —  p)  — >  0,  ifyu(T)  ->  0  for  all  1  <  i  <  tv, 

n  oo 

TE  ( XT  -  p)'  (XT  -  p)  yn(h), 

i—  1  h=—o o 
oo 

'f  J2  |  yn(h)  |  <  oo  for  all  1  <  i  <  n. 

h= — oo 


Proof.  The  Theorem  can  be  established  by  applying  Theorem  4. 1  individually  to 
each  time  series  {Xit},  i  =  1, 2, . . . ,  n.  □ 
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1 1  Estimation  of  Covariance  Function 


Thus,  the  sample  average  converges  in  mean  square  and  therefore  also  in 
probability  to  the  true  mean.  Thereby  the  second  condition  is  more  restrictive 
than  the  first  one.  They  are,  in  particular,  fulfilled  for  all  VARMA  processes  (see 
Chap.  12).  As  in  the  univariate  case  analyzed  in  Sect.  4.1,  it  can  be  shown  with  some 
mild  additional  assumptions  that  Xj  is  also  asymptotically  normally  distributed. 

Theorem  11.2.  For  any  stationary  process  { X,  j 


xt  —  ii+  ^2  ^ JZt~j 

j=-oo 

with  Z,  ~  IID(0,  £)  and  JT__0O  ||T'/||  <  oo,  the  arithmetic  average  Xt  is 
asymptotically  normal: 


Vf  (XT  -  l-i)  — *—>■  N  (  0,  J2  T(h) 


h=—o o 


=  n  o,  y.  H  £  % 


\j=-o O 


V=-oo 


=  N(o,^(i)s^(iy). 


Proof.  The  proof  is  a  straightforward  extension  to  the  multivariate  case  of  the  one 
given  for  Theorem  4.2  of  Sect.  4.1.  □ 


The  summability  condition  is  quite  general.  It  is,  in  particular,  fulfilled  by  causal 
VARMA  processes  (see  Chap.  12)  as  their  coefficients  matrices  T,  go  exponentially 
fast  to  zero.  Remarks  similar  to  those  following  Theorem  4.2  apply  also  in  the 
multivariate  case. 

The  above  formula  can  be  used  to  construct  confidence  regions  for  /i.  This 
turns  out,  however,  to  be  relatively  complicated  in  practice  so  that  often  univariate 
approximations  are  used  instead  (Brockwell  and  Davis  1996,  228-229). 

As  in  the  univariate  case,  a  natural  estimator  for  the  covariance  matrix  function 
T(/?)  is  given  by  the  corresponding  empirical  moments  f(/i): 

f  f  (X,+h  ~  XT)  (X,  -  XT)' ,  0  <  h  <  T  -  1; 

f  (h)  =  l 

[f'(-ft),  -r+i</7<o. 

The  estimator  of  the  covariance  function  can  then  be  applied  to  derive  an  estimator 
for  the  correlation  function: 


R(h)  =  V~1/2f(h)V-1/2 


1 1 .2  Testing  Cross-Correlations  of  Time  Series 
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where  V1/2  =  diag  ^ -y/pi i  (0), . . . ,  -y/ p,,,, (0) J .  Under  the  conditions  given  in 
Theorem  11.2  the  estimator  of  the  covariance  matrix  of  order  /?,  r (/;),  converges 
to  the  true  covariance  matrix  F (/?).  Moreover,  -Jl'  ( F (/?)  —  T (h)j  is  asymptotically 
normally  distributed.  In  particular,  we  can  state  the  following  Theorem: 

Theorem  11.3.  Let  {X,}  be  a  stationary  process  with 

OO 

x{  =  fi  +  ^2  ^ jzt-j 

j=-oo 

where  Z,  ~  IID(0,  S),  Ylj^-oo  ll'fyll  <  °°>  anc ^  %  7^  0-  Then,  for  each 

fixed  h,  F{h)  converges  in  probability  as  T  — >  oo  to  T(/i): 

f  (/?)  ----->  F(h) 


Proof.  A  proof  can  be  given  along  the  lines  given  in  Proposition  13.1.  □ 

As  for  the  univariate  case,  we  can  define  the  long-run  covariance  matrix  J  as 

OO 

J=  J2  T(h)-  (H-D 

h=—o o 

As  a  non-parametric  estimator  we  can  again  consider  the  following  class  of 
estimators: 


J  T  — 


f  (h) 


where  k(x)  is  a  kernel  function  and  where  F(h)  is  the  corresponding  estimate 
of  the  covariance  matrix  at  lag  h.  For  the  choice  of  the  kernel  function  and  the 
lag  truncation  parameter  the  same  principles  apply  as  in  the  univariate  case  (see 
Sect.  4.4  and  Haan  and  Levin  (1997)). 


1 1 .2  Testing  Cross-Correlations  of  Bivariate  Time  Series 

The  determination  of  the  asymptotic  distribution  of  F(h)  is  complicated.  We 
therefore  restrict  ourselves  to  the  case  of  two  time  series. 

Theorem  11.4.  Let  j  X,  j  be  a  bivariate  stochastic  process  whose  components  can 
be  described  by 
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oo 

X\,  =  22  ct/Zi  i—j  with  Z\,  ~  IID(0,  of) 

j=-oo 

oo 

z2(=  22  PjZ2’>->  w'th  Zit  ~  HD(0,  erf) 

y=-oo 

where  {Z\,}  and  {Z2,}  are  independent  from  each  other  at  all  leads  and  lags  and 
where  \otj\  <  oo  and  |/3/|  <  oo.  Under  these  conditions  the  asymptotic 
distribution  of  the  estimator  of  the  cross-correlation  function  pi2(/i)  between  {Ai,} 
and  {X2t}  is 


Pn(f)Pn(J)j,  h>  0.  (11.2) 

For  all  h  and  k  with  h  k,  (  VT/Oi^/i),  s/Tpuik))'  converges  in  distribution  to  a 
bivariate  normal  distribution  with  mean  zero  and  variances  and  covariances  given 
by  J2jl-oo  Pn(j)P22(j)  and  0  Pn{])pnij  +  k-  h),  respectively. 

This  result  can  be  used  to  construct  a  test  of  independence,  respectively  uncor- 
relatedness,  between  two  time  series.  The  above  theorem,  however,  shows  that  the 
asymptotic  distribution  of  o/TpnQi)  depends  on  p \  \  (h)  and  p22(/t)  and  is  therefore 
unknown.  Thus,  the  test  cannot  be  based  on  the  cross-correlation  alone.1 

This  problem  can,  however,  be  overcome  by  implementing  the  following  two- 
step  procedure  suggested  by  Haugh  (1976). 

First  step:  Estimate  for  each  times  series  separately  a  univariate  invertible 
ARMA  model  and  compute  the  resulting  residuals  Z,,  as  Zlt  = 
i  =  1 , 2.  If  the  ARMA  models  correspond  to  the  true  ones,  these  residuals  should 
approximately  be  white  noise.  This  first  step  is  called  pre-whitening. 

Second  step:  Under  the  null  hypothesis  the  two  time  series  {Xi,}  and  { A2,j 
are  uncorrelated  with  each  other.  This  implies  that  the  residuals  {Z\,}  and 
{Z2/}  should  also  be  uncorrelated  with  each  other.  The  variance  of  the  cross¬ 
correlations  between  {Zi,}  and  {Z2f}  are  therefore  asymptotically  equal  to  l/T 
under  the  null  hypothesis.  Thus,  one  can  apply  the  result  of  Theorem  1 1 .4 
to  construct  confidence  intervals  based  on  formula  (11.2).  A  95-%  confidence 
interval  is  therefore  given  by  ±1 ,96T~1^2.  The  Theorem  may  also  be  used  to 
construct  a  test  whether  the  two  series  are  uncorrelated. 

If  one  is  not  interested  in  modeling  the  two  time  series  explicitly,  the  simplest  way 
is  to  estimate  a  high  order  AR  model  in  the  first  step.  Thereby,  the  order  should 
be  chosen  high  enough  to  obtain  white  noise  residuals  in  the  first  step.  Instead 


(  °° 

Vfpl2(h)-U  N  O,  22 

V  j=-o 


'The  theorem  may  also  be  used  to  conduct  a  causality  test  between  two  times  series  (see  Sect.  15.1). 
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of  looking  at  each  cross-correlation  separately,  one  may  also  test  the  joint  null 
hypothesis  that  all  cross-correlations  are  simultaneously  equal  to  zero.  Such  a  test 
can  be  based  on  T  times  the  sum  of  the  squared  cross-correlation  coefficients.  This 
statistic  is  distributed  as  a  x2  with  L  degrees  of  freedom  where  L  is  the  number  of 
summands  (see  the  Haugh-Pierce  statistic  (15.1)  in  Sect.  15.1). 


1 1 .3  Some  Examples  for  Independence  Tests 
Two  Independent  AR  Processes 

Consider  two  AR(1)  process  {X\,\  and  { X2t  1  governed  by  the  following  stochastic 
difference  equation  X,,  =  0.8X,  ,_i  +  Zit,  i  =  1,2.  The  two  white  noise  processes 
[Z|,j  and  {Z2t}  are  such  that  they  are  independent  from  each  other.  {X\,\  and 
{X2t}  are  therefore  independent  from  each  other  too.  We  simulate  realizations  of 
these  two  processes  over  400  periods.  The  estimated  cross-correlation  function 
of  these  so-generated  processes  are  plotted  in  the  upper  panel  of  Fig.  11.1.  From 
there  one  can  see  that  many  values  are  outside  the  95  %  confidence  interval  given 
by  ±1.96 r~1//2  =  0.098,  despite  the  fact  that  by  construction  both  series  are 
independent  of  each  other.  The  reason  is  that  the  so  computed  confidence  interval 
is  not  correct  because  it  does  not  take  the  autocorrelation  of  each  series  into 
account.  The  application  of  Theorem  1 1.4  leads  to  the  much  larger  95-%  confidence 
interval  of 


which  is  more  than  twice  as  large.  This  confidence  interval  then  encompasses  most 
the  cross-correlations  computed  with  respect  to  the  original  series. 

If  one  follows  the  testing  procedure  outline  above  instead  and  fits  an  AR(10) 
model  for  each  process  and  then  estimates  the  cross-correlation  function  of  the 
corresponding  residual  series  (filtered  or  pre- whitened  time  series),  the  plot  in 
the  lower  panel  of  Fig.  11.1  is  obtained.2  This  figure  shows  no  significant  cross¬ 
correlation  anymore  so  that  one  cannot  reject  the  null  hypothesis  that  both  time 
series  are  independent  from  each  other. 


2The  order  of  the  AR  processes  are  set  arbitrarily  equal  to  10  which  is  more  than  enough  to  obtain 
white  noise  residuals. 
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Fig.  11.1  Cross-correlations  between  two  independent  AR(1)  processes  with  cp  =  0.8 


Consumption  Expenditure  and  Advertisement  Expenses 

This  application  focuses  on  the  interaction  between  nominal  aggregate  private 
consumption  expenditure  and  nominal  aggregate  advertisement  expenditures.  Such 
an  investigation  was  first  conducted  by  Ashley  et  al.  (1980)  for  the  United  States.3 
The  upper  panel  of  Fig.  11.2  shows  the  raw  cross-correlations  between  the  two 
time  series  where  the  order  h  runs  from  —20  to  +20.  Although  almost  all  cross¬ 
correlations  are  positive  and  outside  the  conventional  confidence  interval,  it  would 
be  misleading  to  infer  a  statistically  significant  positive  cross-correlation.  In  order 
to  test  for  independence,  we  filter  both  time  series  by  an  AR(10)  model  and  estimate 
the  cross-correlations  for  the  residuals.4  These  are  displayed  in  the  lower  panel  of 
Fig.  11.2.  In  this  figure,  only  the  correlations  of  order  0  and  16  fall  outside  the 
confidence  interval  and  can  thus  be  considered  as  statistically  significant.  Thus,  we 


3The  quarterly  data  are  taken  from  Berndt  (1991).  They  cover  the  period  from  the  first  quarter  1956 
to  the  fourth  quarter  1975.  In  order  to  achieve  stationarity,  we  work  with  first  differences. 

4The  order  of  the  AR  processes  are  set  arbitrarily  equal  to  10  which  is  more  than  enough  to  obtain 
white  noise  residuals. 
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Fig.  11.2  Cross-correlations  between  aggregate  nominal  private  consumption  expenditures  and 
aggregate  nominal  advertisement  expenditures 


can  reject  the  null  hypothesis  of  independence  between  the  two  series.  However, 
most  of  the  interdependence  seems  to  come  from  the  correlation  within  the  same 
quarter.  This  is  confirmed  by  a  more  detailed  investigation  in  Berndt  (1991)  where 
no  significant  lead  and/or  lag  relations  are  found. 

Real  Gross  Domestic  Product  and  Consumer  Sentiment 

The  procedure  outlined  above  can  be  used  to  examine  whether  one  of  the  two  time 
series  is  systematically  leading  the  other  one.  This  is,  for  example,  important  in 
the  judgment  of  the  current  state  of  the  economy  because  first  provisional  national 
accounting  data  are  usually  published  with  a  lag  of  at  least  one  quarter.  However, 
in  the  conduct  of  monetary  policy  more  up-to-date  knowledge  is  necessary.  Such  a 
knowledge  can  be  retrieved  from  leading  indicator  variables.  These  variables  should 
be  available  more  quickly  and  should  be  highly  correlated  with  the  variable  of 
interest  at  a  lead. 

We  investigate  whether  the  Consumer  Sentiment  Index  is  a  leading  indicator 
for  the  percentage  changes  in  real  Gross  Domestic  Product  (GDP).5  The  raw 


5  We  use  data  for  Switzerland  as  published  by  the  State  Secretariat  for  Economic  Affairs  SECO. 
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Fig.  11.3  Cross-correlations  between  real  growth  of  GDP  and  the  consumer  sentiment  index 


cross-correlations  are  plotted  in  the  upper  panel  of  Fig.  11.3.  It  shows  several 
correlations  outside  the  conventional  confidence  interval.  The  use  of  this  confidence 
interval  is,  however,  misleading  as  the  distribution  of  the  raw  cross-correlations 
depends  on  the  autocorrelations  of  each  series.  Thus,  instead  we  filter  both  time 
series  by  an  AR(8)  model  and  investigate  the  cross-correlations  of  the  residuals.6 
The  order  of  the  AR  model  was  chosen  deliberately  high  to  account  for  all 
autocorrelations.  The  cross-correlations  of  the  filtered  data  are  displayed  in  the 
lower  panel  of  Fig.  11.3.  As  it  turns  out,  only  the  cross-correlation  which  is 
significantly  different  from  zero  is  for  h  =  1 .  Thus  the  Consumer  Sentiment  Index 
is  leading  the  growth  rate  in  GDR  In  other  words,  an  unexpected  higher  consumer 
sentiment  is  reflected  in  a  positive  change  in  the  GDP  growth  rate  of  next  quarter.7 


6With  quarterly  data  it  is  wise  to  set  to  order  as  a  multiple  of  four  to  account  for  possible  seasonal 
movements.  As  it  turns  out  p  =  8  is  more  than  enough  to  obtain  white  noise  residuals. 

7  During  the  interpretation  of  the  cross-correlations  be  aware  of  the  ordering  of  the  variables 

because  pi2(  1 )  =  P2i(-1)  P2i(l)- 


Stationary  Time  Series  Models:  Vector 
Autoregressive  Moving-Average  Processes 
(VARMA  Processes) 


The  most  important  class  of  models  is  obtained  by  requiring  {X,}  to  be  the  solution 
of  a  linear  stochastic  difference  equation  with  constant  coefficients.  In  analogy  to 
the  univariate  case,  this  leads  to  the  theory  of  vector  autoregressive  moving-average 
processes  (VARMA  processes  or  just  ARMA  processes). 

Definition  12.1  (VARMA  process).  A  multivariate  stochastic  process  { Xt }  is  a  vec¬ 
tor  autoregressive  moving-average  process  of  order  (p,  q),  denoted  as  VARMA(p,  q) 
process,  if  it  is  stationary  and  fulfills  the  stochastic  difference  equation 

x,  -  <&!*,-!  -  ...  -  %X,-P  =Z,  +  ©jZ,-!  +  . . .  +  ©9Z,_9  (12. 1) 

where  ^  0,  @q  ^  0  and  Z,  ~  WN(0,  E).  {X,}  is  called  a  VARMA(p ,  q)  process 
with  mean  p  if{X,  —  /x}  is  a  VARMA(p,  q)  process. 

With  the  aid  of  the  lag  operator  we  can  write  the  difference  equation  more 
compactly  as 


4>(L)A,  =  0(L)Zf 


where  <t>(L)  =  I„  -  <t>,L  -  . . .  -  <&pLp  and  @(L)  =  I„  +  ©jL  +  . . .  +  ©?L«. 
<3>(L)  and  @(L)  are  n  x  n  matrices  whose  elements  are  lag  polynomials  of  order 
smaller  or  equal  to  p,  respectively  q.  If  q  =  0,  ©(L)  =  I„  so  that  there  is  no 
moving-average  part.  The  process  is  then  a  purely  autoregressive  one  which  is 
simply  called  a  VAR(p)  process.  Similarly  if  p  =  0,  =  I„  and  there  is  no 

autoregressive  part.  The  process  is  then  a  purely  moving-average  one  and  simply 
called  a  VMA(q)  process.  The  importance  of  VARMA  processes  stems  from  the 
fact  that  every  stationary  process  can  be  arbitrarily  well  approximated  by  a  VARMA 
process,  VAR  process,  or  VMA  process. 
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1 2  VARMA  Processes 


12.1  The  VAR(1)  Process 

We  start  our  discussion  by  analyzing  the  properties  of  the  VAR(l)  process  which  is 
defined  as  the  solution  the  following  stochastic  difference  equation: 

X,  =  $X,_!  +  Z,  with  Z,  ~  WN(0,  X). 

We  assume  that  all  eigenvalues  of  are  absolutely  strictly  smaller  than  one.  As 
the  eigenvalues  correspond  to  the  inverses  of  the  roots  of  the  matrix  polynomial 
det(<F(z))  =  det(/„  —  this  assumption  implies  that  all  roots  must  lie  outside  the 
unit  circle: 


det(I„  —  <I>i)  ^  0  for  all  z  €  C  with  |z|  <  1 . 


For  the  sake  of  exposition,  we  will  further  assume  that  <t>  is  diagonalizable,  i.e. 
there  exists  an  invertible  matrix  P  such  that  J  =  P~'  (\>P  is  a  diagonal  matrix  with 
the  eigenvalues  of  <t>  on  the  diagonal.1 
Consider  now  the  stochastic  process 

OO 

X,  =  Z,+  <FZ,_!  +  4>2Z,_2  +  ...  =  J2  WZt-j. 

j= o 

We  will  show  that  this  process  is  stationary  and  fulfills  the  first  order  difference 
equation  above.  For  {A,}  to  be  well-defined,  we  must  show  that  ||<3>'||  <  oo. 
Using  the  properties  of  the  matrix  norm  we  get: 


E  ii^'n  =  E  nR//p“1n  ^  E  iiwiip-1 

j=0  j=  0  j=  0 


<Enpii  np_1 

7=0 


n 


N 


Ew 


<  ||P||  ||P  1  ||  EZ  l^maxl27  <  OO, 

7=0 


where  Amax  denotes  the  maximal  eigenvalue  of  4>  in  absolute  terms.  As  all 
eigenvalues  are  required  to  be  strictly  smaller  than  one,  this  clearly  also  holds  for 
Amax  so  that  infinite  matrix  sum  converges.  This  implies  that  the  process  {A,}  is 
stationary.  In  addition,  we  have  that 


*The  following  exposition  remains  valid  even  if  <f>  is  not  diagonalizable.  In  this  case  one  has  to 
rely  on  the  Jordan  form  which  complicates  the  computations  (Meyer  2000). 
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xr  =  J2  ®z'-j  =  Z>  +  ®J2  °,z'-w  =  ^'-1  +  Z" 

7=0  7=0 

Thus,  the  process  {X,\  also  fulfills  the  difference  equation. 

Next  we  demonstrate  that  this  process  is  also  the  unique  stationary  solution  to 
the  difference  equation.  Suppose  that  there  exists  another  stationary  process  {F,} 
which  also  fulfills  the  difference  equation.  By  successively  iterating  the  difference 
equation  one  obtains: 

Yr  =  Z,  +  <J>Z,_ 1  +  <h-F,_2 


—  Z,  +  +  <f)2Z/_2  +  . . .  +  ®*Z,_*  +  <3>k+1Yt-ic- 1. 

Because  {F,}  is  assumed  to  be  stationary,  VYt  =  V Y,-k- 1  =  r(0)  so  that 

V^F,-Xj  <VZt-\  =  <Yk+iY(Yt-k-0&k+l  =  Ot+1r(0)<5,i+1. 

The  submultiplicativity  of  the  norm  then  implies: 

| o*+1r(o)o'*+1 1  <  ||a>*+1|2 1| r (0)||  =  ||P||2||Jp_1||2||r(0)||  ^  |A;|2(*+1)j. 

As  all  eigenvalues  of  <f>  are  absolutely  strictly  smaller  than  one,  the  right  hand  side 
of  the  above  expression  converges  to  zero  for  k  going  to  infinity.  This  implies  that 
Y,  and  X,  =  <f,,Z(_;  are  equal  in  the  mean  square  sense  and  thus  also  in 

probability. 

Based  on  Theorem  10.2,  the  mean  and  the  covariance  function  of  the  VAR(l) 
process  is: 


EX,  =  ^Ezt-j  =  0, 

7=0 

oo  oo 

r(h)  =  J2  (V+hY.0'J  =  <Yh  J2  ds'Zd)''  =  O'T(O). 

7=0  7=0 

Analogously  to  the  univariate  case,  it  can  be  shown  that  there  still  exists  a  unique 
stationary  solution  if  all  eigenvalues  are  absolutely  strictly  greater  than  one.  This 
solution  is,  however,  no  longer  causal  with  respect  to  {Z,}.  If  some  of  the  eigenvalues 
of  <f>  are  on  the  unit  circle,  there  exists  no  stationary  solution. 
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12.2  Representation  in  Companion  Form 

A  VAR(p)  process  of  dimension  n  can  be  represented  as  a  VAR(l)  process  of  dimen¬ 
sion  p  x  n.  For  this  purpose  we  define  the  pn  vector  Y,  =  (X'r  X'_} . X'  +1)'. 

This  new  process  { Y, }  is  characterized  by  the  following  first  order  stochastic 
difference  equation: 


(  xt  \ 

/T|  <J>2  ■  ■  ■  <Vl  %\ 

(Xt-  A 

(zl\ 

Xt-\ 

o 

o 

O 

Xt—2 

0 

Y,  = 

Xt-2 

= 

•  o 

•  o 

•  o 

X,-3 

+ 

0 

\X,-P+J 

•  •  o 

••  ^ 

•  •  o 

•  •  o 

\x,-J 

\o) 

—  QY,-\  +  Ut 

where  U,  =  (Zt,  0, 0, ,  0 ) '  with  U,  ~  WN  ^0,  f  ^  j .  This  representation  is 

also  known  as  the  companion  form  or  state  space  representation  (see  also  Chap.  17). 
In  this  representation  the  last  p(n  —  1)  equations  are  simply  identities  so  that  there  is 
no  error  term  attached.  The  latter  name  stems  from  the  fact  that  Y,  encompasses  all 
the  information  necessary  to  describe  the  state  of  the  system.  The  matrix  <t>  is  called 
the  companion  matrix  of  the  VAR(p)  process.2 

The  main  advantage  of  the  companion  form  is  that  by  studying  the  properties 
of  the  VAR(l)  model,  one  implicitly  encompasses  VAR  models  of  higher  order  and 
also  univariate  AR(p)  models  which  can  be  considered  as  special  cases.  The  relation 
between  the  eigenvalues  of  the  companion  matrix  and  the  roots  of  the  polynomial 
matrix  <t>(~)  is  given  by  the  formula  (Gohberg  et  al.  1982): 

det  (lnp  -  <J>z)  =  det  (/„  -  <J>iz  -  ...  -  QpT?)  .  (12.2) 

In  the  case  of  the  AR(p)  process  the  eigenvalues  of  <J>  are  just  the  inverses  of  the 
roots  of  the  polynomial  <!>(")■  Further  elaboration  of  state  space  models  is  given  in 
Chap.  17. 


12.3  Causal  Representation 

As  will  become  clear  in  Chap.  15  and  particularly  in  Sect.  15.2,  the  issue  of  the 
existence  of  a  causal  representation  is  even  more  important  than  in  the  univariate 


2The  representation  of  a  VAR(p)  process  in  companion  form  is  not  uniquely  defined.  Permutations 
of  the  elements  in  Y,  will  lead  to  changes  in  the  companion  matrix. 
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case.  Before  stating  the  main  theorem  let  us  generalize  the  definition  of  a  causal 
representation  from  the  univariate  case  (see  Definition  2.2  in  Sect.  2.3)  to  the 
multivariate  one. 

Definition  12.2.  A  VARMA((p,q)  process  { X,  }  with  (t>(L)X,  =  @(L )Z,  is  called 
causal  with  respect  to  {Z,}  if  and  only  if  there  exists  a  sequence  of  absolutely 
summable  matrices  {\P,-},  j  =  0,  1,2,...,  i.e.  Ylj^o  Ill'll  <  oo ,  such  that 

OO 

*>  =  E 

7=0 

Theorem  12.1.  Let  {Xt}  be  a  VARMA(p,q)  process  with  <t>(L)A';  =  @(L )Z,  and 
assume  that 


det  <3>(z)  7^  0  for  all  z  e  C  with  |z|  <  1 , 

then  the  stochastic  difference  equation  <J>(L)X,  =  @(L)Z,  has  exactly  one  stationary 
solution  with  causal  representation 


OO 


^  =  E  ^z-/- 

7=0 


whereby  the  sequence  of  matrices  { '-f'/J  is  absolutely  summable  and  where  the 
matrices  are  uniquely  determined  by  the  identity 


cE>(z)'h(z)  =  @(z). 

Proof.  The  proof  is  a  straightforward  extension  of  the  univariate  case.  □ 

As  in  the  univariate  case,  the  coefficient  matrices  which  make  up  the  causal 
representation  can  be  found  by  the  method  of  undetermined  coefficients,  i.e.  by 
equating  0(z)'k(z)  =  @(z).  In  the  case  of  the  VAR(l)  process,  the  { T, }  have  to 
obey  the  following  recursion: 

0  :  T'o  =  In 

z  :  'hi  =  OTo  =  <J> 

z2  :  T2  =  <h'h1  =  <I>2 

■i  :  'h/  =  (h'hj-i  =  A>j 


The  recursion  in  the  VAR(2)  case  is: 
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0  : 

*0  =  In 

z  : 

-<t>,  +  vki  =  o 

vF, 

=  4>i 

z2  : 

-$2  -  (FjXpj  +  \p2  =  0 

=>• 

q/2 

=  <J>2  +  <J>7 

z3: 

—  <$>!  -  02^i  +  ^3  =  0 

=>• 

q/3 

=  Oj  ^1^2  -)-  ^2^1 

Remark  12.1.  Consider  a  VAR(l)  process  with  <F  =  ^  j  with  <p  ^  0  then 

the  matrices  in  the  causal  representation  are  'I'y  =  O'  =  0  for  j  >  1.  This 
means  that  {X,}  has  an  alternative  representation  as  a  VMA(l)  process  because 
Xt  =  Z,  +  | .  This  simple  example  demonstrates  that  the  representation  of 

{Xt}  as  a  VARMA  process  is  not  unique.  It  is  therefore  impossible  to  always 
distinguish  between  VAR  and  VMA  process  of  higher  orders  without  imposing 
additional  assumptions.  These  additional  assumptions  are  much  more  complex  in 
the  multivariate  case  and  are  known  as  identifying  assumptions.  Thus,  a  general 
treatment  of  this  identification  problem  is  outside  the  scope  of  this  book.  See  Hannan 
and  Deistler  (1988)  for  a  general  treatment  of  this  issue.  For  this  reason  we  will 
concentrate  exclusively  on  VAR  processes  where  these  identification  issues  do  not 
arise. 


Example 

We  illustrate  the  above  concept  by  the  following  VAR(2)  model: 


X, 


Xt—2  +  Z, 


with  Z, 


In  a  first  step,  we  check  whether  the  VAR  model  admits  a  causal  representation 
with  respect  to  {Ztj.  For  this  purpose  we  have  to  compute  the  roots  of  the  equation 
det (I2  -  $12  -  ^aZ2)  =  0: 


/I  -0.8Z  +  0.3Z2  0.5z+0.3z2  \ 

V  -0.1z  +  0.2z2  1  +  0.5z  -  0.3z2J 

=  1  -  0.3z  -  0.35z2  +  0.32z3  -  0.15z4  =  0. 
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The  four  roots  are:  —1.1973,0.8828  ±  1.6669t.  1.5650.  As  they  are  all  outside 
the  unit  circle,  there  exists  a  causal  representation  which  can  be  found  from  the 
equation  4>(z)\P(z)  =  h  by  the  method  of  undetermined  coefficients.  Multiplying 
the  equation  system  out,  we  get: 

h  ~  4>iz  -$2 z2 
+  'Iqz-Oi'lqz2  -  0>2^iz3 

+^2z2  -  &IV2? -QiViz* 

=  h- 


Equating  the  coefficients  corresponding  to  z',j  =  1,2,.. .: 
z  :  =  4>I 

z2  :  vT2  =  t&i'Ti  +  $2 
Z3  I  ^3  =  <t>|  '1'2  + 

z7':  Vj  =  +  4>2'P2-2. 

The  last  equation  shows  how  to  compute  the  sequence  { }  recursively: 

=  /°.8  -°.5A  /  0.29-0.45A 

\0.1  —0.5  /  “  V— 0-17  0.50/ 

_  /  0.047  -0.310\ 

3  _  V— 0.016  —0.345/ 


1 2.4  Computation  of  the  Covariance  Function 
of  a  Causal  VAR  Process 

As  in  the  univariate  case,  it  is  important  to  be  able  to  compute  the  covariance 
and  the  correlation  function  of  VARMA  process  (see  Sect.  2.4).  As  explained  in 
Remark  12.1  we  will  concentrate  on  VAR  processes.  Consider  first  the  case  of  a 
causal  VAR(l)  process: 


X,  =  +  Z,  Z,  ~  WN (0,  E). 

Multiplying  the  above  equation  first  by  X'  and  then  successively  by  X'_h  from  the 
left,  h  =  1,2,...,  and  taking  expectations,  we  obtain  the  Yule- Walker  equations: 

E  (X,X't)  =  r(0)  =  <SE  (X'-tX',)  +  E  (Z,X'r)  =  4>r(— 1)  +  s, 

E  (X,X't_h)  =  r(h)  =  4>  E  (X,-iX't_h)  +  E  (Z,x',-h)  =  <S>T{h  -  1). 
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Knowing  F(0)  and  <F,  r(/7),  h  >  0,  can  be  computed  recursively  from  the  second 
equation  as 


T(h)  =  4>Ar(0),  h  —  1,2,...  (12.3) 

Given  <f>  and  X,  we  can  compute  F(0).  For  h  =  1,  the  second  equation  above 
implies  T(l)  =  <J>r(0).  Inserting  this  expression  in  the  first  equation  and  using  the 
fact  that  T(— 1)  =  r ( 1 )',  we  get  an  equation  in  r(0): 

r(0)  =  +  x. 

This  equation  can  be  solved  for  Y  (0): 

vecTfO)  =  vec($r(0)<J>/)  +  vecX 

=  (<f>  (g>  <J>)vecr(0)  +  vecX, 

where  (g)  and  “vec”  denote  the  Kronecker-product  and  the  vec-operator,  respec¬ 
tively.3  Thus, 


veer (0)  =  (/„ 2  -  4>  <g>  4>)-1vecX.  (12.4) 

The  assumption  that  {X,}  is  causal  with  respect  to  {Z,}  guarantees  that  the 
eigenvalues  of  Cg>  <I>  are  strictly  smaller  than  one  in  absolute  value,  implying  that 
/„ 2  —  <T>  (g)  is  invertible.4 

If  the  process  is  a  causal  VAR(p)  process  the  covariance  function  can  be  found  in 
two  ways.  The  first  one  rewrites  the  process  in  companion  form  as  a  VAR(l)  process 
and  applies  the  procedure  just  outlined.  The  second  way  relies  on  the  Yule- Walker 
equation.  This  equation  is  obtained  by  multiplying  the  stochastic  difference  equation 
from  the  left  by  X'  and  then  successively  by  X'_h,  h  >  0,  and  taking  expectations: 


r  (0)  =  OiF(-l)  +  . . .  +  %Y(-p)  +  X, 

=  4>ir(iy  +  . . .  +  $>pT(p)'  +  X, 

T(h)  =  <t>iT(h  -  1)  +  . . .  +  %Y(h - p ).  (12.5) 

The  second  equation  can  be  used  to  compute  Y(h),  h  >  p,  recursively  taking 

<l>i . ct>;,  and  the  starting  values  T(p  —  1), . . . ,  T(0)  as  given.  The  starting  value 

can  be  retrieved  by  transforming  the  VAR(p)  model  into  the  companion  form  and 
proceeding  as  explained  above. 


3The  vec-operator  stacks  the  column  of  a  n  x  m  matrix  to  a  column  vector  of  dimension  nm.  The 
properties  of  ®  and  vec  can  be  found,  e.g.  in  Magnus  and  Neudecker  (1988). 

4If  the  eigenvalues  of  <f>  are  A,-,  i  =  1 . n,  then  the  eigenvalues  of  <f>  (g)  <£  are  A,A./,  i,j  =  1 . n 

(see  Magnus  and  Neudecker  (1988)). 
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Example 


We  illustrate  the  computation  of  the  covariance  function  using  the  same  example  as 
in  Sect.  12.3.  First,  we  transform  the  model  into  the  companion  form: 


Y,  = 


(  Xu  \ 

Xi,< 


/0.8  -0.5  -0.3  — 0.3\ 
0.1  -0.5  -0.2  0.3 
10  0  0 
\o  l  o  o  / 


'XU-x\ 

(ZU\ 

X2,,-i 

+ 

Z2,t 

XU-2 

0 

\X2j-2) 

l  0  y 

Equation  (12.4)  implies  that  [>(0)  is  given  by: 


vecFy(O)  =  vec 


(  r*(0)  rx(i)\ 
VFx(iy  rY(0)j 


(/i6  -  $  ®  O)  'vec 


so  that  rx(0)  and  F,v(  I )  become: 


/ 2.4201  0.5759\  /  1.3996  -0.571 1\ 

\0.5759  3.8978/  X  “  V-0.4972  -2.5599j 


The  other  covariance  matrices  can  then  be  computed  recursively  according  to 
Eq.  (12.5): 


rx(2)  =  <f>,rx(i)  +  <f>2rx(0)  = 


rx(3)  =  o,rx(2)  +  d>2rx(i)  = 


/ 0.4695  -05191  \ 
\0.0773  2.2770/  ' 

/  0.0662  — 0.6145\ 
\— 0.4208  -1.8441/ 


Appendix:  Autoregressive  Final  Form 

Definition  12.1  defined  the  VARMA  process  {A,}  as  a  solution  to  the  corresponding 
multivariate  stochastic  difference  equation  (12.1).  Flowever,  as  pointed  out  by 
Zellner  and  Palm  (1974)  there  is  an  equivalent  representation  in  the  form  of  n 
univariate  ARMA  processes,  one  for  each  Xit.  Formally,  these  representations,  also 
called  autoregressive  final  form  or  transfer  function  form  (Box  and  Jenkins  1976), 
can  be  written  as 


detO(L)A/f  =  [$*(L)0(L)]..Z, 
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where  the  index  indicates  the  i-th  row  of  <t>* (L)0(L).  Thereby  T>*  (L)  denotes  the 
adjugate  matrix  of  4>(L).5  Thus  each  variable  in  X,  may  be  investigated  separately  as 
an  univariate  ARMA  process.  Thereby  the  autoregressive  part  will  be  the  same  for 
each  variable.  Note,  however,  that  the  moving-average  processes  will  be  correlated 
across  variables. 

The  disadvantage  of  this  approach  is  that  it  involves  rather  long  AR  and  MA  lags 
as  will  become  clear  from  the  following  example.6  Take  a  simple  two-dimensional 
VAR  of  order  one,  i.e.  X,  =  ~  WN(0,  E).  Then  the  implied  univariate 

processes  will  be  ARMA(2,1)  processes.  After  some  straightforward  manipulations 
we  obtain: 

(1  —  (011  +  0  22)L  +  (011022  —  012021  )L2)2fl(  =  Z\t  —  022-Zl.f— 1  +  012Z2.1-1  > 

(1  —  (011  +  022 )L  +  (011022  —  012021  )L~)JT2/  =  lp2\Z\t-\  +  Z21  —  0llZ2,/-i  . 

It  can  be  shown  by  the  means  given  in  Sects.  1 .4.3  and  1.5.1  that  the  right  hand  sides 
are  observationally  equivalent  to  MA(1)  processes. 


5The  elements  of  the  adjugate  matrix  A*  of  some  matrix  A  are  given  by  [A*]y  =  (— 1  y^My  where 
My  is  the  minor  (minor  determinant)  obtained  by  deleting  the  z-th  column  and  they-th  row  of  A 
(Meyer  2000,  p.  477). 

sThe  degrees  of  the  AR  and  the  MA  polynomial  can  be  as  large  as  np  and  (n  —  1  )p  +  q,  respectively. 
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13.1  Introduction 

In  this  chapter  we  derive  the  Least-Squares  (LS)  estimator  for  vectorautoregressive 
(VAR)  models  and  its  asymptotic  distribution.  For  this  end,  we  have  to  make  several 
assumption  which  we  maintain  throughout  this  chapter. 

Assumption  13.1.  The  VAR  process  {X,}  is  generated  by 

4>(L)X,  =  4 

X,  -  chi 4-1 - ®PX,-p  =  Z,  with  Z,  ~  WN(0,  S), 

E  nonsingular,  and  admits  a  stationary  and  causal  representation  with  respect 
to  {Z,j: 


X,  =  Z,  +  'Ll  Zf_!  +  ch2Z,_2  +  ...=J2VjZt=  9(L  )Z, 

7=0 


with  YljZo  II  II  <  °°- 

Assumption  13.2.  The  residual  process  {Z,}  is  not  only  white  noise,  but  also 
independently  and  identically  distributed: 

Z,  ~  IID(0,  E). 

Assumption  13.3.  All  fourth  moments  ofZ,  exist.  In  particular,  there  exists  a  finite 
constant  c  >  0  such  that 

E  (ZitZjtZktZ,)  <  c  for  all  i,j,  k,l  =  1,2 , ,n,  and  for  all  t. 
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Note  that  the  moment  condition  is  automatically  fulfilled  by  Gaussian  processes. 
For  the  ease  of  exposition,  we  omit  a  constant  in  the  VAR.  Thus,  we  consider  the 
demeaned  process. 


13.2  The  Least-Squares  Estimator 

Let  us  denote  by  ipjy  the  (ij)-th  element  of  the  matrix  <Jv,  k  =  1,2 . p,  then  the 

i-th  equation,  i  =  \ can  be  written  as 

Xu  =  0,-j  <j>\n  +  •  •  •  +  <Pn  *Ai J-p  +  •  •  •  +  </>//,  ^Xn  t-p  +  Zit. 

We  can  view  this  equation  as  a  regression  equation  of  X,,  on  all  lagged  variables 
. . . ,  Xn  t- 1 , . . .  ,Xi  t-p, . . .  ,Xn  t-p  with  error  term  Z,f.  Note  that  the  regres¬ 
sors  are  the  same  for  each  equation.  The  np  regressors  have  coefficient  vector 

.  ■  . ,  (j>^\  ■  ■  ■ ,  <^f* . 0,^')  •  Thus,  the  complete  VAR(p)  model  has  n2p 

coefficients  in  total  to  be  estimated.  In  addition,  there  are  n(n  +  l)/2  independent 
elements  of  the  covariance  matrix  E  that  have  to  be  estimated  too. 

It  is  clear  that  the  n  different  equations  are  linked  through  the  regressors  and  the 
errors  terms  which  in  general  have  non-zero  covariances  Ojj  =  EZ^Zy,.  Hence,  it 
seems  warranted  to  take  a  systems  approach  and  to  estimate  all  equations  of  the 
VAR  jointly.  Below,  we  will  see  that  an  equation-by-equation  approach  is,  however, 
still  appropriate. 

Suppose  that  we  have  T  +  p  observations  with  t  =  —p+  1, . . . ,  0, 1, _ _  T,  then 

we  can  write  the  regressor  matrix  for  each  equation  compactly  as  a  T  x  np  matrix  X: 

^  Xi  o  •  •  •  Xno  . . .  X\  ~p-\-\  . . .  Xn  ~p- 1- 1 S 

Ai  i  •  •  •  Xn  i  . . .  X\'—p+2  •  •  •  X,j  -p-\-2 
X =  _  _  > 

yXi.r-i  .  .  .  XyjJ  —  ]  .  .  .  X\  J-p  .  •  •  X nj-p  ) 

Using  this  notation,  we  can  write  the  VAR  for  observations  t  =  1,2, ...,7’ as 


(Ai,  A2, . . . ,  A t) 

=  Y 


l  A0 


A,  ...  A7_,\ 
A0  ...  Ar_2 


\A- 


■p+ 1  X-p+2 


Xt-pJ 


=x" 


+  (Zi,Z2  ,...,ZT) 


=z 
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or  more  compactly 


Y  =  OX'  +  Z. 

There  are  two  ways  to  bring  this  equation  system  in  the  usual  multivariate  regression 
framework.  One  can  either  arrange  the  data  according  to  observations  or  according 
to  equations.  Ordered  in  terms  of  observations  yields: 

vec  Y  =  vecl^X')  +  vecZ  =  (X  ®  /„)  vec  <!>  +  vecZ  (13.1) 

with  vec  Y  =  (X\  i ,  X2\ , . . .  ,Xni,Xi2,X22,  ■  ■  ■  ,X„2 . X\t,X2t,  ■  ■  ■  ,X„j)' .  If  the 

data  are  arranged  equation  by  equation,  the  dependent  variable  is  vec  Y'  = 
(Xu,Xn, XlT, X2\ , X22 X2T X„i, Xn2, XnT)'.  As  both  representa¬ 
tions,  obviously,  contain  the  same  information,  there  exists  a  nT  x  nT  permutation 
or  commutation  matrix  KnT  such  that  vec  Y'  =  Knt  vec  Y.  Using  the  computation 
rules  for  the  Kronecker  product,  the  vec  operator,  and  the  permutation  matrix 
(see  Magnus  and  Neudecker  1988),  we  get  for  the  ordering  in  terms  of  equations 

vec  Y'  =  Knj  vec  Y  =  Knj  (vec(3>X')  +  vecZ) 

=  K„t(X  ®  /„)  vec  <J>  +  Kn T  vecZ 
=  (/„  ®  X)Knip  vec  <f>  +  K,tT  vec  Z 
=  (/„  ®  X)  vec  O'  +  vec  Z!  (13.2) 

where  K„ip  is  the  corresponding  tr  xp  permutation  matrix  relating  vec  $  and  vec  cl>' . 

The  error  terms  of  the  different  equations  are  correlated  because,  in  general,  the 
covariances  <t;/  =  EZ„Z;(  are  nonzero.  In  the  case  of  an  arrangement  by  observation 
the  covariance  matrix  of  the  error  term  vec  Z  is 

VvecZ  =  E(vecZ)(vecZ)/ 


\  x 

...  (Tin  0  . 

..  0 

...  0 

..  0 

fT/il 

...  a2n  0  . 

..  0 

...  0 

..  0 

0 

0  0  CTj2 

•  •  ®\n 

...  0 

..  0 

0 

.  .  .  0  On  1  . 

.  A  .. 

3  ts) 

...  0 

..  0 

0 

...  0  0  . 

..  0 

•  •  •  °7 

•  •  &\n 

Vo 

...  0  0  . 

..  0 

•  • •  &nl  • 

..  0^ 
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In  the  second  case,  the  arrangement  by  equation,  the  covariance  matrix  of  the  error 
term  vec  Z'  is 

V  vec  Z!  =  E(vecZ,)(vecZ,)/ 


(a\  . 

. .  0  a12  . 

..  0 

•  •  •  @ln  • 

..  0  \ 

0  . 

..  0  . 

•  •  0"12 

...  o  . 

•  •  n 

021  • 

..  0  a\  . 

..  0 

•  •  •  n  • 

..  0 

0  . 

■  •  0'21  0  . 

..  of 

...  o  . 

•  •  ®2  n 

®n\  • 

. .  0  a,  a  . 

..  0 

..  0 

Vo  . 

£ 

o  •  • 

•  •  ®n2 

...  0  . 

..  al) 

Given  that  the  covariance  matrix  is  not  a  multiple  of  the  identity  matrix,  efficient 
estimation  requires  the  use  of  generalized  least  squares  (GLS).  The  GLS  estimator 
minimizes  the  weighted  sum  of  squared  errors 

^(vecO)  =  (vec Z)'(IT  ®  X)_1(vecZ)  — »  min. 

The  solution  of  this  minimization  problem  can  be  found  in  standard  econometric 
textbooks  like  (Dhrymes  1978;  Greene  2008;  Hamilton  1994b)  and  is  given  by 

(vec  <L)Gls  =  ((X  ®  In)' {h  ®  X)-1  (X  ®  /„))  '  (X  ®>  /„)'(/-/■  ®  X)^1  vec  Y 
=  ((X'  <g>  In) {It  <8>  S_1)(X  ®  /„))  1  (X'  ®  I„)(IT  <S>  X-1)  vec  Y 
=  ((X'  <g>  X-‘)(X  ®  /„))“’  (X'  ®  X-1)  vec  Y 
=  ((X'X)  ®  X-1)-1  (X'  ®  X"1)  vec  Y 

=  ((X'X)-1  ®  X)  (X'  <g>  X-1)  vec  Y  =  (((X'X)_1X')  ®  /„)  vec  Y 
=  (vec  4>)ols 


As  the  covariance  matrix  X  cancels,  the  GLS  and  the  OLS-estimator  deliver 
numerically  exactly  the  same  solution.  The  reason  for  this  result  is  that  the 
regressors  are  the  same  in  each  equation.  If  this  does  not  hold,  for  example  when 
some  coefficients  are  set  a  priori  to  zero,  efficient  estimation  would  require  the  use 
of  GLS. 
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Further  insights  can  be  gained  by  rewriting  the  estimation  problem  in  terms  of 
the  arrangement  by  equation  (see  Eq.  (13.2)).  For  this  purpose,  multiply  the  above 
estimator  from  the  left  by  the  commutation  matrix  Kn2p : 


(vec  $')ols  =  Kn ip(vec  $)0ls  =  Knip  (((X'X)  'X')  <g>  /„)  vec  Y 

=  (/„  ®  ((X'X)"1^))  KnT  vec  Y  =  (I„  <g>  ((X'Xi-'X'))  vec  Y' . 

This  can  be  written  in  a  more  explicit  form  as 


(vec  $')ols  = 


/(X'X^X'  0 

0  (X'Xr'X'  . . . 


0 

0 


\ 

vec  Y' 


V  0 


o  ...  (x'xr'xv 


/(X'X)-'X'  Y\  o  ...  o  \ 

o  (x,x)-1x,y2 ...  o 

V  0  0  ...  (X!X)~lX!Yn) 


where  Y,,  i  =  1 ..... «,  stacks  the  observations  of  the  i-th  variable  such  that  Y,  = 
(Xn,Xa,  ■  ■  ■ , Xiry .  Thus,  the  estimation  of  VAR  as  a  system  can  be  broken  down 
into  the  estimation  of  n  regression  equations  with  dependent  variable  X,,.  Each  of 
these  equations  can  then  be  estimated  by  OLS. 

Thus,  we  have  proven  that 


vec  $  =  (vec  $)GLS  =  (vec  $)OLs  =  (((X'xr'X')  ®  /„)  vec  F,  (13.3) 

vec  &  =  (vec  $')gls  =  (vec  $%ls  =  (/„  ®  ((X'X)"1^))  vec  Y' .  (13.4) 


The  least  squares  estimator  can  also  be  rewritten  without  the  use  of  the  uec-operator: 

$  =  yX(X'X)-1. 


Under  the  assumptions  stated  in  the  Introduction  Sect.  13.1,  these  estimators  are 
consistent  and  asymptotically  normal. 

Theorem  13.1  (Asymptotic  Distribution  of  OLS  Estimator).  Under  the  assumption 
stated  in  the  Introduction  Sect.  13.1,  it  holds  that 

plim  4>  =  4> 


'Alternatively,  one  could  start  from  scratch  and  investigate  the  minimization  problem  S(vec  $')  = 
(vecZ'JT^-1  ®  AKvecZ')  — >  minj,. 
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and  that 


by  observation:  Vt  fvec  <I>  —  vec  <J>^  - »  N  (0,  T;J  1  (g>  S) , 


respectively, 


by  equation:  a ff  ^vec  <J>/  —  vec  j  - >  N  (0,  E  ®  Tp  *) 


where  Tp  =  plim  i(X'X). 


Proof.  See  Sect.  13.3. 


□ 


In  order  to  make  use  of  this  result  in  practice,  we  have  to  replace  the  matrices  E 
and  Tp  by  some  estimate.  A  natural  consistent  estimate  of  Tp  is  given  according  to 
Proposition  13.1  by 


In  analogy  to  the  multivariate  regression  model,  a  natural  estimator  for  E  can  be 
obtained  from  the  Least-Squares  residuals  Z: 


The  property  of  this  estimator  is  summarized  in  the  proposition  below. 
Theorem  13.2.  Under  the  condition  of  Theorem  13.1 


Proof.  See  Sect.  13.3. 


□ 


An  alternative,  but  asymptotically  equivalent  estimator  E  is  obtained  by  adjust¬ 
ing  E  for  the  degrees  of  freedom: 


T 


E  = 


(13.5) 


T  —  np 


If  the  VAR  contains  a  constant,  as  is  normally  the  case  in  practice,  the  degrees  of 
freedom  correction  should  be  T  —  np  —  1 . 


13.3  Proofs  of  Asymptotic  Normality 
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Small  sample  inference  with  respect  to  the  parameters  <t>  can  therefore  be  carried 
out  using  the  approximate  distribution 

vec  $  ~  N  (vec  <t>,  E  ®  (X'X)-1)  .  (13.6) 

This  implies  that  hypothesis  testing  can  be  carried  out  using  the  conventional  t-  and 
F-statistics.  From  a  system  perspective,  the  appropriate  degree  of  freedom  for  the 
t-ratio  would  be  nT  —  n2p  —  n,  taking  a  constant  in  each  equation  into  account. 
However,  as  that  the  system  can  be  estimated  on  an  equation  by  equation  basis, 
it  seems  reasonable  to  use  T  —  up  —  1  instead.  This  corresponds  to  a  multivariate 
regression  setting  with  T  observation  and  up  +  1  regressors,  including  a  constant. 

However,  as  in  the  univariate  case  the  Gauss  Markov  theorem  does  not  apply 
because  the  lagged  regressors  are  correlated  with  past  error  terms.  This  results  in 
biased  estimates  in  small  samples.  The  amount  of  the  bias  can  be  assessed  and 
corrected  either  by  analytical  or  bootstrap  methods.  For  an  overview,  a  comparison 
of  the  different  corrections  proposed  in  the  literature,  and  further  references  see 
Engsteg  and  Pedersen  (2014). 


1 3.3  Proofs  of  the  Asymptotic  Properties 
of  the  Least-Squares  Estimator 

Lemma  13.1.  Given  the  assumptions  made  in  Sect.  13. 1,  the  process  { vec  Zf_jZ'_(}, 
i,j  G  Z  and  i  ^  j,  is  white  noise. 

Proof.  Using  the  independence  assumption  of  {Z,},  we  immediately  get 

E  vec  Zt_yZ'_;  =  E(Z,_,  <g>  ZH)  =  0, 

¥(vecZ,_;z;_,)  =  E  ®  Z_,)(Z,_,  ®  Z,_;)') 

=  E  ((Z,_,Z;_,)  ®  (Z,_,Z;_;))  =  £  ®  E, 

^vecZ,_yZ'_,(^)  =  E  (( Z,-j  ®>  Z,-j)(Zt-i-h  ®>  Z,-j-hy) 

=  E  ((Z^ZU-h)  ®  ^-jA-j-h))  =  0,  h  ±  0. 


□ 

Under  the  assumption  put  forward  in  the  Introduction,  y  ( X'  X )  converges  in 
probability  for  T  -Mxitoa  np  x  np  matrix  This  matrix  consists  of  p2  blocks 
where  each  (i,j)-th  block  corresponds  to  the  covariance  matrix  T(i  —  j).  Thus  we 
have  the  following  proposition: 

Proposition  13.1.  Under  the  assumption  stated  in  the  Introduction  Sect.  13.1 
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XX 

~ 


p 


(  r(0)  r(i)  ...r{p- 1)\ 

r'(i)  r(0)  ...vip-2) 

\r'(p- 1)  T'ip-2) ...  r (0) 


with  Tp  being  nonsingular. 


Proof.  Write  jhX'X)  as 


XX 

~ 


(  f(0)  f(l)  ...Tip—  1)\ 
f'(l)  f(0)  ...Tip -2) 

\f'(p-l)  T'ip-2)  ...  f(0)  / 


where 


1  r_1 

T(h)  =  -J2X>X',-. 


h  =  0,l,...,p-l. 


1=0 


We  will  show  that  each  component  T (h)  of  ^-X'X  converges  in  probability  to  T(h). 
Taking  the  causal  representation  of  {X,}  into  account 

T- 1  ,  T—\  oo  oo 


?ih)  =  1  =  j  £  £  £ 

t= 0  /=0  7=0  /= 0 

oo  oo  /  .  r-i  \ 

7=0  f=0  V  /=0  / 

oo  oo  /  .  T—  1  \ 

=  EE$i  ?EzhzhU,-*- 


7=0  i=h  \  t=  0  / 

According  to  Lemma  13.1  above  {Z,_jZ'_(}, i  /  j,  is  white  noise.  Thus, 

7-1 


1 


0, 


*  +  j. 


1=0 
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according  to  Theorem  11.1.  Hence,  for  m  fixed, 

m  m-\-h  /  1  T—  1 


in  itt-yit  /  i  — i  \ 

Gm(h)  =  E  E  ^  ?  E 


0. 


/= 0  i=h 
¥i 


t= 0 


Taking  absolute  values  and  expectations  element-wise. 


E\Goo(h)^Gm(h)\=E 


E  %(^EZ-/Z-) 

j>m  Of  \  t=  0  / 

¥i 

j>m  Of  i>m-\-h  \  t=  0  / 


¥i 


E  i*,-i  (Eizi^i)  i*;_ 

j>m  or  i>m-\-h 
i^j 

<  E  1^1  (ElZlZ2l)  Ml 

j>m  or  i>m 
¥i 

<Ei^](eizi^iEi^i) 


j>m 


+  Emi  m^\Wm 


As  the  bound  is  independent  of  T  and  converges  to  0  as  m  — »  oo,  we  have 
lim  limsupE  |  Goo  (ft)  —  G,„(h)\  =  0. 

m-+oc  T 

The  Basic  Approximation  Theorem  C.14  then  establishes  that 


Goo  (ft) - >  0. 


Henceforth 


oo  /  ,  7-i  \ 

r(ft)  =  Goo  (ft)  +  E  *j  t  E  z‘-Jz'-j  %+ 

j=h  y1  1=0  / 

oo  / 1  r-i  \ 

=  Goo  (ft)  +  E^J  tE  Z,Z<  )  ^'l~h  +  remainder 

j=h  \l= 0  / 
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where  the  remainder  only  depends  on  initial  conditions2  and  is  therefore  negligible 
as  T  — >  oo.  As 


1 

T 


T- 1 


P 


*  E, 


we  finally  get 


j=h 

The  last  equality  follows  from  Theorem  10.2.  □ 


Proposition  13.2.  Under  the  assumption  stated  in  the  Introduction  Sect.  13.1 

T 


4=  vec(Z,x;_1 ,  Z,X't_2, . . . ,  Z,x[-P ) 
v 1  ,=\ 


—=  vec(ZX)  =  — =(X'  ®  /„)  vecZ 

Vt  Vt 


->  N(0,  r„  ®  E) 


Proof.  The  idea  of  the  proof  is  to  approximate  {X,}  by  some  simpler  process 
which  allows  the  application  of  the  CLT  for  dependent  processes  (Theorem  C.13). 
This  leads  to  an  asymptotic  distribution  which  by  the  virtue  of  the  Basic  Approxima¬ 
tion  Theorem  C.14  converges  to  the  asymptotic  distribution  of  the  original  process. 
Define  xl"'}  as  the  truncated  process  from  the  causal  presentation  of  Xt: 


Xf  *  —  Zt  +  ThZ,_i  +  . . .  +  VmZt-m,  m  —  P,P  +  Up  +  2, . 
Using  this  approximation,  we  can  then  define  the  process  as 


(x(t'f\\ 


V(tm)  =  vec  (ztX{"f\,  Z,X ZtX{tf]p\  = 


x: 


( m ) 
t—2 


)Zf. 


y(m) 
\X'-P ) 


2See  the  proof  of  Theorem  1 1.2.2  in  Brockwell  and  Davis  (1991)  for  details. 
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Due  to  the  independence  of  {Z,}  this  process  is  a  mean  zero  white  noise  process, 
but  is  clearly  not  independent.  It  is  easy  to  see  that  the  process  is  actually  (m  +  in¬ 
dependent  with  variance  V,„  given  by 


V,„  =  E yfn)y,(m)/  - 


=  E 


((x^\\ 

v(m) 
ai-2 


y(>n) 
\xl-p 


E 


(xl™\\ 

Y(m) 

At- 2 


v(m) 

\Ai-p 


\ 


)Z, 


((x™\ 


X 


(m) 

1-2 


\<) 


V 


)Z, 


fx(->\'\ 

v0w) 
At-2 


\At~P 


r  ('»)(! y 


r('»)(i) 

r<m)(0) 


i  ez,z; 


...  r(n,)o- 1)\ 

...  rimHp-2) 


V r im) {p -  iy  r (m) (p-iy ...  r im) (o) 


=  r<,m)  <g>  e 


where  T/,'"’  is  composed  of 


T{m](h)  =  ex,("'}x,("'}'_;! 

=  E  (Z,_i  +  'I'iZf_2  +  .  .  .  +  ^mZt-l-m) 

(Zt—i—h  +  'h  i  zt—2—i,  +  . . .  +  mzt-\-m-h)' 

m 

=  E^-  *  =  0-1 . P~L 

i=i i 


Thus,  we  can  invoke  the  CLT  for  (m  +  /;  (-dependent  process  (see  Theorem  C.13)  to 
establish  that 


>N(0,Vm). 


For  in  — >  oo,  Tfm\h)  converges  to  Y(h)  and  thus  to  T;,.  Therefore,  Y„, 

rp  (g>  E. 
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The  variance  of  the  approximation  error  is  equal  to 

v  (7^  E  (vec^-i  •  ■  ■  ■ .  z'x<-p)  -  Y>m) )] 

=  (e  vec  y . z<(x<-r  -  x>-Py) 

\t=  1 

=  iv(f:vec( 


z,  . z<\  E  ^-p-j 


\j=m+ 1 


j=m-\- 1 


=  v  vec  Zf  £  ^Zr-1  -j  .....  4  E 


y/=m+ 1 


V/=m+ 1 


=  E 


/E~„+i  ^4-t-A 

\ 

<8>  Z, 

/EjEn+t  ^/4-i-A 

\ 

<8>  Z, 

\Ej=»?+i  ^ jzt-p-j) 

) 

\ 

\E/=m+l  ^jZt-p~j) 

) 

/E,=m+I  W'--- 


V 


s. 


•  .  .  E”m+1 


The  absolute  summability  of  T;  then  implies  that  the  infinite  sums  converge  to 

zero  as  m  — >  00.  As  x\m)  - — — >  X,  for  m  — >  00,  we  can  apply  the  Basic 

Approximation  Theorem  C.  14  to  reach  the  required  conclusion 

1  T 

—  vec {Z,X't-i ,  Z,Z;_ 2, . . . ,  ZtX't-P)  — ^  N(0,  Tp  ®  S). 

»=i  1:1 


Proof  of  Theorem  1 3.1 

Proof.  We  prove  the  Theorem  for  the  arrangement  by  observation.  The  prove  for  the 
arrangement  by  equation  can  be  proven  in  a  completely  analogous  way.  Inserting  the 
regression  formula  (13.1)  into  the  least-squares  formula  (13.3)  leads  to: 


vec  $  =  (((X,X)-1X/)  ®  7„)(X  ®  /„)  vec  <f>  +  (((X/X)-1X/)  ®  /„)  vecZ 

=  vec  <f>  +  (((X'X)-'X')  <g>  /„)  vecZ.  (13.7) 
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Bringing  vec  <t>  to  the  left  hand  side  and  taking  the  probability  limit,  we  get  using 
Slutzky’s  Lemma  C.10  for  the  product  of  probability  limits 


plim(vec  <f>  —  vec  <I>)  =  plim  vec  (ZX(X'X)  ') 


=  vec 


f  ZX 
plim  plim 


r. 


?)")- 


The  last  equality  follows  from  the  observation  that  Proposition  13.1  implies 
plim  =  Tp  nonsingular  and  that  Proposition  13.2  implies  plim  ™  =  0.  Thus, 
we  have  established  that  the  Least-Squares  estimator  is  consistent. 

Equation  (13.7)  further  implies: 


Vf(vec  3>  —  vec  <I>)  =  Vf  (((X'X)-1X')  <g>  /„)  vecZ 


(?) 


-l 


— -  )  ®  In  \  — f=  (X'  ®  /„)  vec  Z 


Vf 


As  plim  =  Tp  nonsingular,  the  above  expression  converges  in  distribution 
according  to  Theorem  C.10  and  Proposition  13.2  to  a  normally  distributed  random 
variable  with  mean  zero  and  covariance  matrix 


(Tp  1  ®  I„)(TP  ®  E)(Tp  1  ®  /„)  =  Tp  1  ®  X 

□ 


Proof  of  Theorem  1 3.2 

Proof. 


S  = 


(7-OX')(L-  <J>X')' 


(Y  -  cpX'  +  (O-  <J>)X')(T  -  4>X'  +  (<!>-  <P)X')' 


=  -(Z+  (4>-0)X')(Z+  (O-OlX'y 

ZZ'  ZX  -  X'Z'  -  X  X 

=  —  +  —  (<*>  -  4>y  +  (4>  -  <&)  —  +  (4>  -  <&)  — ($  -  $)' 

Applying  Theorem  C.7  and  the  results  of  Propositions  13.1  and  13.2  shows  that 

ZX(4>-$)'  P 


Vf 


->  o 
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and 


Hence, 


X'x 

($>-$)  —  Vf  ('<!>■ 


$)' 


->  0. 


Vr((r-  $x')(Y  -  $x'y  -^)  =  Vt(x-^) 
{  T  T  )  V  T  j 


->  0 


13.4  The  Yule-Walker  Estimator 

An  alternative  estimation  method  can  be  derived  from  the  Yule-Walker  equations. 
Consider  first  a  VAR(l)  model.  The  Yule- Walker  equation  in  this  case  simply  is: 

T(0)  =  OT(-l)  +  £ 

r  ( i)  =  ono) 


or 


r(0)  =  <t>r(o)<t>'  +  s 
r(i)  =  4>r(0). 

The  solution  of  this  system  of  equations  is: 

<J>  =  r(i)r(o)_1 

£  =  r  (o)  -  $r(0)<t>'  =  r(o)  -  r(i)r(or1r(o)r(or1r(i)' 
=  r(0)  -raircor'ro)'. 


Replacing  the  theoretical  moments  by  their  empirical  counterparts,  we  get  the  Yule- 
Walker  estimator  for  tf>  and  £: 


$  =  f  (l)f(0)-1, 

£  =  f  (0)  -  $f  (0)$'. 
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In  the  general  case  of  a  VAR(p)  model  the  Yule-Walker  estimator  is  given  as  the 
solution  of  the  equation  system 

p 

f(h)  =  ]T  $;f  (/i  -j),  k=  1 . P, 

j= i 

s  =  f  (0)  -  $!?(-!)  - ...  -  o„f  (-P) 


As  the  least-squares  and  the  Yule- Walker  estimator  differ  only  in  the  treatment 
of  the  starting  values,  they  are  asymptotically  equivalent.  In  fact,  they  yield  very 
similar  estimates  even  for  finite  samples  (see  e.g.  Reinsel  (1993)).  However,  as  in  the 
univariate  case,  the  Yule-Walker  estimator  always  delivers,  in  contrast  to  the  least- 
square  estimator,  coefficient  estimates  with  the  property  det(/„  — Ojj— . .  .  —  &pzp)  7^ 
0  for  all  z  e  C  with  \z\  <  1.  Thus,  the  Yule-Walker  estimator  guarantees  that  the 
estimated  VAR  possesses  a  causal  representation.  This,  however,  comes  at  the  price 
that  the  Yule-Walker  estimator  has  a  larger  small-sample  bias  than  the  least-squares 
estimator,  especially  when  the  roots  of  4>(z)  get  close  to  the  unit  circle  (Tjpstheim 
and  Paulsen  1983;  Shaman  and  Stine  1988;  Reinsel  1993).  Thus,  it  is  generally 
preferable  to  use  the  least-squares  estimator  in  practice. 
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14.1  Forecasting  with  Known  Parameters 

The  discussion  of  forecasting  with  VAR  models  proceeds  in  two  steps.  First,  we 
assume  that  the  parameters  of  the  model  are  known.  Although  this  assumption  is 
unrealistic,  it  will  nevertheless  allow  us  to  introduce  and  analyze  important  concepts 
and  ideas.  In  a  second  step,  we  then  investigate  how  the  results  established  in  the 
first  step  have  to  be  amended  if  the  parameters  are  estimated.  The  analysis  will 
focus  on  stationary  and  causal  VAR(l)  processes.  Processes  of  higher  order  can  be 
accommodated  by  rewriting  them  in  companion  form.  Thus  we  have: 

X,  =  TV-,  +  Z„  Z,  ~  WN(0,  E), 

oo 

Xt  =  Z,  +  T  |  Z(_  |  +  TFZf-a  +  •  •  ■  =  tyjZt-j, 

j=  o 

where  =  Oh  Consider  then  the  following  forecasting  problem:  Given  observa¬ 
tions  {XT ,  XT-i , . . . ,  Xi },  find  a  linear  function,  called  predictor  or  forecast  function, 
PtXt+i,'  h  >  1,  which  minimizes  the  expected  quadratic  forecast  error 


E  ( Xj+h  —  PiXt+h)'  (Xr+h  —  IWr+/i) 

=  Eti(XT+h  —  PjXT+h){XT+h  ~  PtXt+Ii)' . 

Thereby  “tr”  denotes  the  trace  operator  which  takes  the  sum  of  the  diagonal  elements 
of  a  matrix.  As  we  rely  on  linear  forecasting  functions,  PjXr+h  can  be  expressed  as 

P-rX-r+h  =  A\Xj  +  AiXj^i  +  . . .  +  AjX\  (14.1) 
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with  matrices  A1.A2, . . .  ,Aj  still  to  be  determined.  In  order  to  simplify  the 
exposition,  we  already  accounted  for  the  fact  that  the  mean  of  { X,  }  is  zero.1 
A  justification  for  focusing  on  linear  least-squares  forecasts  is  given  in  Chap.  3. 
The  first  order  conditions  for  the  least-squares  minimization  problem  are  given  by 
the  normal  equations: 

IE  (XT+h  ~  PrAr-t-*)  K  —  IE  (Xr+h  ~  Ai-Xj-  —  ...  —  ATX 1)  X's 

=  KXT+hX's  -  AiEXrXj  -  ...  -  At¥XxX's  =  0,  1  <  j  <  T. 

These  equations  state  that  the  forecast  error  (Xj+n  —  PtXt+i, )  must  be  uncorrelated 
with  the  available  information  Xs,  s  =  1,2 , ,T.  The  normal  equations  can  be 
written  as 


/ 

(At, A2, . . . , At ) 


r  (0) 

r'(i) 


r  ( 1 ) 
r(o) 


...  r(r-i)\ 
...  r(r-2) 


\r'(r- 1)  ...  r(0) 


=  (r (/?)  T(h  +  l) ...  r(r +  /*-i)). 


Denoting  by  T  j  the  matrix 


Tt  = 


( 


r  (0) 

r'(i) 


r ( 1 )  ...  r(r-  i)\ 

T(0)  . . .  T(T  —  2) 


\T'{T-  1)  T,(7’-2)  ...  T(0)  / 


the  normal  equations  can  be  written  more  compactly  as 

(A1,A2>...,Ar)rr=  (T(/r)  T(h  +  1)  . . .  T(T  +  h-  1)) . 

Using  the  assumption  that  {X,}  is  a  VAR(l),  F(/i)  can  be  expressed  as  F ( h )  = 
T  (0)  (see  Eq.  (12.3))  so  that  the  normal  equations  become 


(A1;  A2, . . .  ,At) 


(  r(0) 

r(0)<F' 


\r(0)<i) 


JT- 1 


or(0)  ...or_1r(o)\ 
r(0)  . . .  o7-2r(0) 

r(0)<h'7-2 ...  r(0)  ) 


=  (4>/!r(o)  <F/,+1r(0) . . .  cp7+*— 1  r(0)) . 


1  If  the  mean  is  non-zero,  a  constant  Aq  must  be  added  to  the  forecast  function. 


14.1  Forecasting  with  Known  Parameters 


243 


The  easily  guessed  solution  is  given  by  A\  =  <t>/'  and  A?  =  . . .  =  At  =  0.  Thus, 
the  sought-after  forecasting  function  for  the  VAR(l)  process  is 

PyAy =  O^Ay.  (14.2) 

The  forecast  error  Ay+/,  —  P rXr+h  has  expectation  zero.  Thus,  the  linear  least- 
squares  predictor  delivers  unbiased  forecasts.  As 

Xr+h  =  Zt+I,  +  OZy+;,_i  +  .  .  .  +  4>/!  *Zy+i  +  O^Ay, 
the  expected  squared  forecast  error  (mean  squared  error)  MSE(/7)  is 
MSE(A)  =  E  (XT+h  -  0,!Ay)  (XT+h  -  0%)' 

h-l 

=  E  +  OEO'  +  . . .  +  <f>/,_1  EO,/,_1  =  (14.3) 

i=o 

In  order  to  analyze  the  case  of  a  causal  VAR(p)  process  with  T  >  p,  we  transform 
the  model  into  the  companion  form.  For  h  =  1,  we  can  apply  the  result  above  to  get: 


(PyAy+i^ 

/$!  02  Op_!  Op\ 

i  XT  \ 

Xt 

I„  0  ...  0  0 

Xt-  1 

pryr+i  =  QYj  = 

Xt-  1 

= 

•  0 

•  a"' 

•  0 

•  0 

Xt- 2 

yXT-p+2/ 

■  •  0 

■  ■  •->? 

•  •  0 

■  •  0 

\XT-P+ 1/ 

This  implies  that 

PyAy+i  =  <&\Xt  +  OyAy-  1  +  .  .  .  +  Op Xj-p+ 1 .  (14.4) 

The  forecast  error  is  XT+\  —  PyAy+ 1  =  Z,  which  has  mean  zero  and  covariance 
variance  matrix  E.  In  general  we  have  that  P rYj+h  =  O^Fy  so  that  P yAy+/,  is 
equal  to 

P 'rXr+h  =  Oj lf>Xr  +  O^Ay-i  +  . . .  +  ^^Xt-p+  \ 

where  i  =  1 , ,p,  denote  the  blocks  in  the  first  row  of  OP  Alternatively,  the 
forecast  for  h  >  1  can  be  computed  recursively.  For  h  =  2  this  leads  to: 


P yAy+2  —  Er  (OiAy+i)  +  P y  (OyAy)  +  .  .  .  +  Py  (OpAy+y-p)  +  Py  (Zy+y) 

=  Oi  (OiAy  +  O2  Ay_ \  +  ■  ■  ■  +  QpXj+l-p) 

+  Oy  Xt  +  ■  ■  ■  +  OpAy+2-p 

=  (Of  4-  Oy)  Ay  +  (<J>102  +  O3) Xf—i  +  . . .  +  (OiO;,_i  +  Op)Zy+2_p 
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For  /i  >  2  we  proceed  analogously.  This  way  of  producing  forecasts  is  sometimes 
called  iterated  forecasts. 

In  general,  the  forecast  error  of  a  causal  VAR(p)  process  can  be  expressed  as 

Xr+h  ~  PrXr+h  =  ZT+i,  +  ^\ZT+h-\  +  . . .  +  'Fa-iZ^+i 

h- 1 

=  X!  ®jZT+h-j- 

7=0 


The  MSE (h)  then  is: 

h —  1 

MSE(/t)  =  E  +  'F1E'F'1  +  . . .  +  ty,-!  =  J2  (14.5) 

7=0 


Example 


Consider  again  the  VAR(2)  model  of  Sect.  12.3.  The  forecast  function  in  this  case 
is  then: 


P  tXt+  i 


VjXj+2 


T  i  X,  +  4>2  A,_  | 


( <S>  j  +  <&2)Xt  +  <I>i  OiX,-! 

(  °-29-0A5)xt+(-014-°-39)x^ 

V— 0.17  0.50,/  V  0.07  —0.18 / 

(3>1  +  +  $2<J>l)2fr  +  (OjOa  +  ^i)Xt-! 

I  0.047  — 0.310N  /  0.003  -0.222\ 

0.016  -0.3457*'  +  V-°  049  0.201 )  X'~l 


Based  on  the  results  computed  in  Sect.  12.3,  we  can  calculate  the  corresponding 
mean  squared  errors  (MSE): 


MSE(l)  =  E 


MSE(2)  =  £  +  'I'iE'l',1 


MSE(3) 


E  +  T'IE'F,1  +  ^2^%  = 


/ 2.2047  0.3893^ 
\0.3893  2.9309/  ' 


A  practical  forecasting  exercise  with  additional  material  is  presented  in  Sect.  14.4. 
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1 4.1 .1  Wold  Decomposition  Theorem 

At  this  stage  we  note  that  Wold’s  theorem  or  Wold’s  Decomposition  carries  over  to 
the  multivariate  case  (see  Sect.  3.2  for  the  univariate  case).  This  Theorem  asserts  that 
there  exists  for  each  purely  non-deterministic  stationary  process2  a  decomposition, 
respectively  representation,  of  the  form: 


Xt — m VjZt-j, 
i=  o 

where  T'o  =  /„,  Z,  ~  WN(0,  E)  with  E  >  0  and  ||  T;  ||2  <  oo.  The  innovations 
{Z,}  have  the  property  Z,  =  X,  —  P,_  |  X,  and  consequently  Z,  =  PfZf.  Thereby  P, 
denotes  the  linear  least-squares  predictor  based  on  the  infinite  past  {Xt,Xt-i, . . .}. 
The  interpretation  of  the  multivariate  case  is  analogous  to  the  univariate  one. 


14.2  Forecasting  with  Estimated  Parameters 

In  practice  the  parameters  of  the  VAR  model  are  usually  unknown  and  have 
therefore  to  be  estimated.  In  the  previous  Section  we  have  demonstrated  that 

PtXt+i,  =  <I>iPrVr7’+/i-i  +  ■  ■  ■  +  ^pf’rXj+h-p 

where  ¥TXT+h~j  =  Yr+h-j  if  j  >  h.  Replacing  the  true  parameters  by  their 
estimates,  we  get  the  forecast  function 

VTXT+h  =  3>iPrV_7'+/i-i  +  ■  ■  •  +  ®pPTXT+h-P- 


where  a  hat  indicates  the  use  of  estimates.  The  forecast  error  can  then  be  decom¬ 
posed  into  two  components: 

Xr+h  ~  PrXr+h  =  ( Xp+h  —  P7 'Xj+h)  +  (VtXt+h  —  P rXr+h ) 

h- 1 

=  ^2  *3? jZr+h-j  +  (PrXr+h  —  PtXt+i ^  •  (14.6) 

j=  0 

Dufour  (1985)  has  shown  that,  under  the  assumption  of  symmetrically  distributed 
Z,'s  (i.e.  if  Z,  and  — Z,  have  the  same  distribution)  the  expectation  of  the  forecast 
error  is  zero  even  when  the  parameters  are  replaced  by  their  least-squares  estimates. 


2A  stationary  stochastic  process  is  called  deterministic  if  it  can  be  perfectly  forecasted  from  its 
infinite  past.  It  is  called  purely  non-deterministic  if  there  is  no  deterministic  component  (see 
Sect.  3.2). 
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This  result  holds  despite  the  fact  that  these  estimates  are  biased  in  small  samples. 
Moreover,  the  results  do  not  assume  that  the  model  is  correctly  specified  in  terms 
of  the  order  p.  Thus,  under  quite  general  conditions  the  forecast  with  estimated 
coefficients  remains  unbiased  so  that  E  (Xj+h  —  PrAr+M  =  0. 

If  the  estimation  is  based  on  a  different  sample  than  the  one  used  for  forecasting, 
the  two  terms  in  the  above  expression  are  uncorrelated  so  that  its  mean  squared  error 
is  by  the  sum  of  the  two  mean  squared  errors: 


h-1 

MSE (h)  =  'P/S'J'j 

j=o 

+  E  (FtXt+i,  —  fTXT+h^  (PTXT+I,  —  P7 -Xj+I^  .  (14.7) 

The  last  term  can  be  evaluated  by  using  the  asymptotic  distribution  of  the  coeffi¬ 
cients  as  an  approximation.  The  corresponding  formula  turns  out  to  be  cumbersome. 
The  technical  details  can  be  found  in  Liitkepohl  (2006)  and  Reinsel  (1993).  The 
formula  can,  however,  be  simplified  considerably  if  we  consider  a  forecast  horizon 
of  only  one  period.  We  deduce  the  formula  for  a  VAR  of  order  one,  i.e.  taking 
X,  =  +  Z,,  Z,  ~  WN(0,  E). 


P fXjjf-h  —  P jXjjfh  —  ( 0  —  —  vec  ^(  0  —  0)26/'^  —  {Xj  ®  In )  vec(  0  —  0) . 


This  implies  that 

E  (pTXT+h  —  P7Ar+/i)  (PtXt+i,  ~  PtXt+IiJ 

=  E(Xj  ®  /„)  vec(0  —  O)(vec(0  —  0)),(X7’  <g>  /„) 

=  E(Ar  ®  /„)  —  ( Xj  ®  In)  =  —  E(ArT  j  Xj)  ®  E 

=  ^E(trA^r71Ar)  ®  S  =  ^trfr^'E^A^))  <g>  E 

=  ^(tr(4)  ®  E)  =  ~ E . 

Thereby,  we  have  used  the  asymptotic  normality  of  the  least-squares  estimator  (see 
Theorem  13.1)  and  the  assumption  that  forecasting  and  estimation  uses  different 
realizations  of  the  stochastic  process.  Thus,  for  h  =  1  and  p  =  1,  we  get 


MSE(l)  =  E  +  ^E  = 


l±l  E 


T 
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Higher  order  models  can  be  treated  similarly  using  the  companion  form  of  VAR(p). 
In  this  case: 


-  np  T  +  np 

MSE(l)  =  E  +  y  E  =  — y~  S- 


(14.8) 


This  is  only  an  approximation  as  we  applied  asymptotic  results  to  small  sample 
entities.  The  expression  shows  that  the  effect  of  the  substitution  of  the  coefficients 
by  their  least-squares  estimates  vanishes  as  the  sample  becomes  large.  However,  in 
small  sample  the  factor  can  be  sizeable.  In  the  example  treated  in  Sect.  14.4, 
the  covariance  matrix  E,  taking  the  use  of  a  constant  into  account  and  assuming 
8  lags,  has  to  be  inflated  by  r+'^+1  =  196+^8+1  =  1.168.  Note  also  that  the 
precision  of  the  forecast,  given  E,  diminishes  with  the  number  of  parameters. 


1 4.3  Modeling  of  VAR  Models 

The  previous  section  treated  the  estimation  of  VAR  models  under  the  assumption 
that  the  order  of  the  VAR,  p,  is  known.  In  most  cases,  this  assumption  is  unrealistic 
as  the  order  p  is  unknown  and  must  be  retrieved  from  the  data.  We  can  proceed 
analogously  as  in  the  univariate  case  (see  Sect.  5.1)  and  iteratively  test  the 
hypothesis  that  coefficients  corresponding  to  the  highest  lag,  i.e.  <f>p  =  0,  are 
simultaneously  equal  to  zero.  Starting  from  a  maximal  order  pmax,  we  test  the  null 
hypothesis  that  Q>Pmax  =  0  in  the  corresponding  VAR(/;,„„a  )  model.  If  the  hypothesis 
is  not  rejected,  we  reduce  the  order  by  one  to  pmax  —  1  and  test  anew  the  null 
hypothesis  1  =  0  using  the  smaller  VAR (pmax  —  1)  model.  One  continues 

in  this  way  until  the  null  hypothesis  is  rejected.  This  gives,  then,  the  appropriate 
order  of  the  VAR.  The  different  tests  can  be  carried  out  either  as  Wald-tests  (F-tests) 
or  as  likelihood-ratio  tests  (/2-tests)  with  n 2  degrees  of  freedom. 

An  alternative  procedure  to  determine  the  order  of  the  VAR  relies  on  some 
information  criteria.  As  in  the  univariate  case,  the  most  popular  ones  are  the  Akaike 
(AIC),  the  Schwarz  or  Bayesian  (BIC)  and  the  Hannan-Quinn  criterion  (HQC). 
The  corresponding  formula  are: 


AIC(p): 


BIC(p): 


,2 


HQC(p) :  In  det  Ep  +  -F—  In  (In  T) , 


where  E p  denotes  the  degree  of  freedom  adjusted  estimate  of  the  covariance  matrix 
E  for  a  model  of  order  p  (see  equation(13.5)).  n2p  is  the  number  of  estimated 
coefficients.  The  estimated  order  is  then  given  as  the  minimizer  of  one  of  these 
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criteria.  In  practice  the  Akaike’s  criterion  is  the  most  popular  one  although  it  has  a 
tendency  to  deliver  orders  which  are  too  high.  The  BIC  and  the  HQ-criterion  on  the 
other  hand  deliver  the  correct  order  on  average,  but  can  lead  to  models  which  suffer 
from  the  omitted  variable  bias  when  the  estimated  order  is  too  low.  Examples  are 
discussed  in  Sects.  14.4  and  15.4.5. 

Following  Liitkepohl  (2006),  Akaike’s  information  criterion  can  be  rationalized 
as  follows.  Take  as  a  measure  of  fit  the  determinant  of  the  one  period  approximate 
mean-squared  errors  MSE(l)  from  Eq.  (14.8)  and  take  as  an  estimate  of  E  the 
degrees  of  freedom  corrected  version  in  Eq.  (13.5).  The  resulting  criterion  is  called 
according  to  Akaike  (1969)  the  final  prediction  error  (FPE): 


FPE(p)  =  det  x  — - E  =  L+IE. 

V  T  T-np  )  \T-np) 


aet  2j. 


(14.9) 


Taking  logs  and  using  the  approximations  fa  1  +  ^  and  log(l  +  ^P)  fa 
we  arrive  at 


AIC(p)  «  log  FPE(p). 


1 4.4  Example:  A  VAR  Model  for  the  U.S.  Economy 

In  this  section,  we  illustrate  how  to  build  and  use  VAR  models  for  forecasting 
key  macroeconomic  variables.  For  this  purpose,  we  consider  the  following  four 
variables:  GDP  per  capita  ({ Yt}),  price  level  in  terms  of  the  consumer  price  index 
(CPI)  ({P,}),  real  money  stock  Ml  ({M,}),  and  the  three  month  treasury  bill  rate 
({R,\).  All  variables  are  for  the  U.S.  and  are,  with  the  exception  of  the  interest  rate, 
in  logged  differences.3  The  components  of  X,  are  with  the  exception  of  the  interest 
rate  stationary.4  Thus,  we  aim  at  modeling  X,  =  (A  log  Yt,  A  logPf,  A  log M,.  R,)' . 
The  sample  runs  from  the  first  quarter  1959  to  the  first  quarter  2012.  We  estimate 
our  models,  however,  only  up  to  the  fourth  quarter  2008  and  reserve  the  last  thirteen 
quarters,  i.e.  the  period  from  the  first  quarter  2009  to  first  quarter  of  2012,  for  an  out- 
of-sample  evaluation  of  the  forecast  performance.  This  forecast  assessment  has  the 
advantage  to  account  explicitly  of  the  sampling  variability  in  estimated  parameter 
models. 

The  first  step  in  the  modeling  process  is  the  determination  of  the  lag-length. 
Allowing  for  a  maximum  of  twelve  lags,  the  different  information  criteria  produce 
the  values  reported  in  Table  14.1.  Unfortunately,  the  three  criteria  deliver  different 


3  Thus,  A  logP,  equals  the  inflation  rate. 

4Although  the  unit  root  test  indicate  that  R,  is  integrated  of  order  one,  we  do  not  difference  this 
variable.  This  specification  will  not  affect  the  consistency  of  the  estimates  nor  the  choice  of  the 
lag-length  (Sims  et  al.  1990),  but  has  the  advantage  that  each  component  of  X,  is  expressed  in 
percentage  points  which  facilitates  the  interpretation. 
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Table  14.1  Information 
criteria  for  the  VAR  models 
of  different  orders 


Order 

AIC 

BIC 

HQ 

0 

-14.498 

-14.429 

-14.470 

1 

-17.956 

-17.611 

-17.817 

2 

-18.638 

-18.016 

-18.386 

3 

-18.741 

-17.843 

-18.377 

4 

-18.943 

-17.768 

-18.467 

5 

-19.081 

-17.630 

-18.493 

6 

-19.077 

-17.349 

-18.377 

7 

-19.076 

-17.072 

-18.264 

8 

-19.120 

-16.839 

-18.195 

9 

-18.988 

-16.431 

-17.952 

10 

-18.995 

-16.162 

-17.847 

11 

-18.900 

-15.789 

-17.639 

12 

-18.884 

-15.497 

-17.512 

Minimum  in  bold 


orders:  AIC  suggests  8  lags,  HQ  5  lags,  and  BIC  2  lags.  In  such  a  situation  it  is 
wise  to  keep  all  three  models  and  to  perform  additional  diagnostic  tests.5  One  such 
test  is  to  run  a  horse-race  between  the  three  models  in  terms  of  their  forecasting 
performance. 

We  evaluate  the  forecasts  according  to  the  two  criteria:  the  root-mean-squared- 
error  (RMSE)  and  the  mean-absolute-error  (MAE)6: 


RMSE  : 

MAE  : 


,  T+h 

TXa‘-x‘>2 

\|  r+i 

1  T+h 

| 

r+i 


(14.10) 

(14.11) 


where  X„  and  X,,  denote  the  forecast  and  the  actual  value  of  variable  i  in  period  t. 
Forecasts  are  computed  for  a  horizon  h  starting  in  period  T.  We  can  gain  further 
insights  by  decomposing  the  mean-squared-error  additively  into  three  components: 


1 

h 


T+h 

J2(Xit-Xit)2 


r+i 


2 


+(05.  -  Ox,-)2  +  2(1  -  p)a^oXr 


5  Such  tests  would  include  an  analysis  of  the  autocorrelation  properties  of  the  residuals  and  tests  of 
structural  breaks. 

Alternatively  one  could  use  the  mean-absolute-percentage-error  (MAPE).  However,  as  all  vari¬ 
ables  are  already  in  percentages,  the  MAE  is  to  be  preferred. 
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The  first  component  measures  how  far  the  mean  of  the  forecasts  jj  YXt+1  's  awaY 
from  the  actual  mean  of  the  data  Xj.  It  therefore  measures  the  bias  of  the  forecasts. 
The  second  one  compares  the  standard  deviation  of  the  forecast  a ~  to  those  of  the 
data  (TXj  .  Finally,  the  last  component  is  a  measure  of  the  unsystematic  forecast  errors 
where  p  denotes  the  correlation  between  the  forecast  and  the  data.  Ideally,  each  of 
the  three  components  should  be  close  to  zero:  there  should  be  no  bias,  the  variation 
of  the  forecasts  should  correspond  to  those  of  the  data,  and  the  forecasts  and  the 
data  should  be  highly  positively  correlated.  In  order  to  avoid  scaling  problems,  all 
three  components  are  usually  expressed  as  a  proportion  of  i  ~  ^ii)2'- 


bias  proportion: 

((!££!*.)-*<) 

(14.12) 

variance  proportion: 

(05.  -  Ox,  )2 

(14.13) 

covariance  proportion: 

2(1  -  p)c^oXi 

\YJrX\(%-Xn)2 

(14.14) 

We  use  these  models  to  produce  dynamic  or  iterated  forecasts  PjXt+x, 

P7W7-+2,  •  •  •  ,PrXT+h-  Forecasts  for  h  >  2  are  computed  iteratively  by  inserting 
for  the  lagged  variables  the  forecasts  obtained  in  the  previous  steps.  For 
details  see  Chap.  14.  Alternatively,  one  may  consider  a  recursive  or  rolling 
out-of-sample  strategy  where  the  model  is  reestimated  each  time  a  new 
observation  becomes  available.  Thus,  we  would  evaluate  the  one-period-ahead 
forecasts  PtXt+\,Pt+\Xt+i,  ■  ■  ■  ,Pr+h-i^T+h,  the  two-period-ahead  forecasts 
PtXj+2^t+iXt+3,  . . .  ,PT+h-2XT+h,  and  so  on.  The  difference  between  the 
recursive  and  the  rolling  strategy  is  that  in  the  first  case  all  observations  are  used 
for  estimation  whereas  in  the  second  case  the  sample  is  rolled  over  so  that  its  size  is 
kept  fixed  at  T. 

Figure  14.1  displays  dynamic  or  iterated  forecasts  for  the  four  variables 
expressed  in  log-levels,  respectively  in  levels  for  the  interest  rate.  Forecast 
are  evaluated  according  to  the  performance  measures  explained  above.  The 
corresponding  values  are  reported  in  Table  14.2.  All  models  see  a  quick  recovery 
after  the  recession  in  2008  and  are  thus  much  too  optimistic.  The  lowest  RMSE 
for  log  Y,  is  5.678  for  the  VAR(8)  model.  Thus,  GDP  per  capita  is  predicted  to  be 
on  average  almost  6%  too  high  over  the  forecast  period.  This  overly  optimistic 
forecast  is  reflected  in  a  large  bias  proportion  which  amounts  to  more  than  95  %. 
The  situation  looks  much  better  for  the  price  level.  All  models  see  an  increase  in 
inflation  starting  in  2009.  Especially,  the  two  higher  order  models  fare  much  better. 
Their  RMSE  is  just  over  1  %.  The  bias  proportion  is  practically  zero  for  the  VAR(8) 
model.  The  forecast  results  of  the  real  money  stock  are  mixed.  All  models  predict  a 
quick  recovery.  This  took  indeed  place,  but  first  at  a  more  moderate  pace.  Starting  in 
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Fig.  14.1  Forecast  comparison  of  alternative  models,  (a)  log  Yt.  (b)  log/3,,  (c)  logM,.  (d)  R, 


mid-2010  the  unconventional  monetary  policy  of  quantitative  easing,  however,  led 
to  an  unforeseen  acceleration  so  that  the  forecasts  turned  out  to  be  systematically 
too  low  for  the  later  period.  Interestingly,  the  smallest  model  fared  significantly 
better  than  the  other  two.  Finally,  the  results  for  the  interest  rates  are  very  diverse. 
Whereas  the  VAR(2)  model  predicts  a  rise  in  the  interest  rate,  the  other  models 
foresee  a  decline.  The  VAR(8)  model  even  predicts  a  very  drastic  fall.  However, 
all  models  miss  the  continuation  of  the  low  interest  rate  regime  and  forecasts  an 
increase  starting  already  in  2009.  This  error  can  again  be  attributed  to  the  unforeseen 
low  interest  rate  monetary  policy  which  was  implemented  in  conjunction  with  the 
quantitative  easing.  This  misjudgement  resulted  in  a  relatively  large  bias  proportion. 

Up  to  now,  we  have  just  been  concerned  with  point  forecasts.  Point  forecasts, 
however,  describe  only  one  possible  outcome  and  do  not  reflect  the  inherent 
uncertainty  surrounding  the  prediction  problem.  It  is,  thus,  a  question  of  scientific 
integrity  to  present  in  addition  to  the  point  forecasts  also  confidence  intervals.  One 
straightforward  way  to  construct  such  intervals  is  by  computing  the  matrix  of  mean- 
squared-errors  MSE  using  Eq.  (14.5).  The  diagonal  elements  of  this  matrix  can  be 
interpreted  as  a  measure  of  the  forecast  error  variances  for  each  variable.  Under 
the  assumption  that  the  innovations  {Z, }  are  Gaussian,  such  confidence  intervals  can 
be  easily  computed.  However,  in  practice  this  assumption  is  likely  to  be  violated. 
This  problem  can  be  circumvented  by  using  the  empirical  distribution  function  of 
the  residuals  to  implement  a  bootstrap  method  similar  to  the  computation  of  the 
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Table  14.2  Forecast 
evaluation  of  alternative  VAR 
models 


VAR(2) 

VAR(5) 

VAR(8) 

log  Y, 

RMSE 

8.387 

6.406 

5.678 

Bias  proportion 

0.960 

0.961 

0.951 

Variance  proportion 

0.020 

0.010 

0.001 

Covariance  proportion 

0.020 

0.029 

0.048 

MAE 

8.217 

6.279 

5.536 

logP, 

RMSE 

3.126 

1.064 

1.234 

Bias  proportion 

0.826 

0.746 

0.001 

Variance  proportion 

0.121 

0.001 

0.722 

Covariance  proportion 

0.053 

0.253 

0.278 

MAE 

2.853 

0.934 

0.928 

logM, 

RMSE 

5.616 

6.780 

9.299 

Bias  proportion 

0.036 

0.011 

0.002 

Variance  proportion 

0.499 

0.622 

0.352 

Covariance  proportion 

0.466 

0.367 

0.646 

MAE 

4.895 

5.315 

7.762 

R, 

RMSE 

2.195 

2.204 

2.845 

Bias  proportion 

0.367 

0.606 

0.404 

Variance  proportion 

0.042 

0.337 

0.539 

Covariance  proportion 

0.022 

0.057 

0.057 

MAE 

2.125 

1.772 

2.299 

RMSE  and  MAE  for  log  Yt,  log P„  and  logM,  are 
multiplied  by  100 


Value-at-Risk  in  Sect.  8.4.  Figure  14.2  plots  the  forecasts  of  the  VAR(8)  model 
together  with  a  80  %  confidence  interval  computed  from  the  bootstrap  approach.  It 
shows  that,  with  the  exception  of  the  logged  price  level,  the  actual  realizations  fall 
out  of  the  confidence  interval  despite  the  fact  that  the  intervals  are  already  relatively 
large.  This  documents  the  uniqueness  of  the  financial  crisis  and  gives  a  hard  time 
for  any  forecasting  model. 

Instead  of  computing  a  confidence  interval,  one  may  estimate  the  probability 
distribution  of  possible  future  outcomes.  This  provides  a  complete  description  of  the 
uncertainty  related  to  the  prediction  problem  (Christoffersen  1998;  Diebold  et  al. 
1998;  Tay  and  Wallis  2000;  Corradi  and  Swanson  2006).  Finally,  one  should  be 
aware  that  the  innovation  uncertainty  is  not  the  only  source  of  uncertainty.  As  the 
parameters  of  the  model  are  themselves  estimated,  there  is  also  a  coefficient  uncer¬ 
tainty.  In  addition,  we  have  to  face  the  possibility  that  the  model  is  misspecified. 

The  forecasting  performance  of  the  VAR  models  may  seem  disappointing  at 
first.  However,  this  was  only  be  a  first  attempt  and  further  investigations  are  usually 
necessary.  These  may  include  the  search  for  structural  breaks  (See  Bai  et  al.  1998; 
Perron  2006).  This  topic  is  treated  in  Sect.  18.1.  Another  reason  for  the  poor 
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Fig.  14.2  Forecast  of  VAR(8)  model  and  80  %  confidence  intervals  (red  dotted  lines),  (a)  log  Yt. 
(b)  log P, .(c)  logM,.  (d)  R, 


forecasting  may  be  due  to  the  over-parametrization  of  VAR  models.  The  VAR(8) 
model,  for  example,  32  lagged  dependent  variables  plus  a  constant  in  each  of  the 
four  equations  which  leads  to  a  total  132  parameters.  This  problem  can  be  dealt  with 
by  applying  Bayesian  shrinkage  techniques.  This  approach,  also  known  as  Bayesian 
VAR  (B  VAR),  was  particularly  successful  when  using  the  so-called  Minnesota  prior 
(See  Doan  et  al.  1984;  Litterman  1986;  Kunst  and  Neusser  1986;  Banbura  et  al. 
2010).  This  prior  is  presented  in  Sect.  18.2. 

Besides  these  more  fundamental  issues,  one  may  rely  on  more  technical  reme¬ 
dies.  One  such  remedy  is  the  use  of  direct  rather  iterated  forecasts.  This  difference  is 
best  explained  in  the  context  of  the  VAR(l)  model  X,  =  <f>V,_i  +Z,,Z,  ~  WN(0,  X). 
The  iterated  forecast  for  XT+/,  uses  the  OLS-estimate  <I>  to  compute  the  forecast 
(see  Chap.  14).  Alternatively,  one  may  estimate  instead  of  the  VAR(l),  the 
model  X,  =  YX,_/,  +  Z,  and  compute  the  direct  forecast  for  Xj+h  as  T Xj.  Although 
T  has  larger  variance  than  <t>  if  the  VAR(l)  is  correctly  specified,  it  is  robust  to 
misspecification  (see  Bhansali  1999;  Schorfheide  2005;  Marcellino  et  al.  2006). 

Another  interesting  and  common  device  is  intercept  correction  or  residual 
adjustment.  Thereby  the  constant  terms  are  adjusted  in  such  a  way  that  the  residuals 
of  the  most  recent  observation  become  zero.  The  model  is  thereby  set  back  on 
track.  In  this  way  the  forecaster  can  guard  himself  against  possible  structural  breaks. 
Residual  adjustment  can  also  serve  as  a  device  to  incorporate  anticipated  events,  like 
announced  policies,  which  are  not  yet  incorporated  into  the  model.  See  Clements 
and  Hendry  (1996,  2006)  for  further  details  and  additional  forecasting  devices. 
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Although  the  estimation  of  VAR  models  poses  no  difficulties  as  outlined  in  the 
previous  chapter,  the  individual  coefficients  are  almost  impossible  to  interpret. 
On  the  one  hand,  there  are  usually  many  coefficients,  a  VAR(4)  model  with 
three  variables,  for  example,  already  has  twelve  coefficients  per  equation  and  thus 
36  coefficients  in  total  to  interpret;  on  the  other  hand,  there  is  in  general  no 
unambiguous  relation  of  the  VAR  coefficients  to  the  coefficients  of  a  particular 
model.  The  last  problem  is  known  as  the  identification  problem.  To  overcome  this 
identification  problem,  many  techniques  have  been  developed  which  should  allow 
to  give  the  estimated  VAR  model  an  explicit  economic  interpretation. 


15.1  Wiener-Granger  Causality 

As  a  first  technique  for  the  understanding  of  VAR  processes,  we  analyze  the  concept 
of  causality  which  was  introduced  by  Granger  (1969).  The  concept  is  also  known  as 
Wiener-Granger  causality  because  Granger’s  idea  goes  back  to  the  work  of  Wiener 
(1956).  Take  a  multivariate  time  series  {A,}  and  consider  the  forecast  of  X\j -+/,, 
h  >  1,  given  XT,Xj-\, . . .  where  {A,}  has  not  only  Ai,  as  a  component,  but  also 
another  variable  or  group  of  variables  X2t.  X,  may  contain  even  further  variables  than 
Xlt  and  X2j.  The  mean-squared  forecast  error  is  denoted  by  MSEi  (h).  Consider  now 
an  alternative  forecast  of  A \j+u  given  Ay,  Xj-\ , . . .  where  {A,}  is  obtained  from 
{A,}  by  eliminating  the  component  X2j.  The  mean-squared  error  of  this  forecast  is 
denoted  by  MSEifh).  According  to  Granger,  we  can  say  that  the  second  variable 
X2  j  causes  or  is  causal  for  X  \ ,  if  and  only  if 


MSE (h)x  <  MSEi  (h)  for  some  h  >  1 . 
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This  means  that  the  information  contained  in  {X2t}  and  its  past  improves  the  forecast 
of  { X | ,  j  in  the  sense  of  the  mean-squared  forecast  error.  Thus  the  concept  of  Wiener- 
Granger  causality  makes  only  sense  for  purely  non-deterministic  processes  and  rest 
on  two  principles1: 

•  The  future  cannot  cause  the  past.  Only  the  past  can  have  a  causal  influence  on 
the  future.2 

•  A  specific  cause  contains  information  not  available  otherwise. 

The  concept  of  Wiener-Granger  causality  played  an  important  role  in  the  debate 
between  monetarists  and  Keynesians  over  the  issue  whether  the  money  stock  has 
an  independent  influence  on  real  activity.  It  turned  out  that  this  question  can  only 
be  resolved  within  a  specific  context.  Sims  (1980a),  for  example,  showed  that  the 
relationship  between  the  growth  rate  of  the  money  stock  and  changes  in  real  activity 
depends  on  whether  a  short  interest  rate  is  accounted  for  in  the  empirical  analysis  or 
not.  Another  problem  of  the  concept  is  that  it  is  not  unambiguously  possible  to  infer 
a  causal  relationship  just  from  the  chronology  of  two  variables  as  demonstrated 
by  Tobin  (1970).  This  and  other  conceptual  issues  (see  Zellner  (1979)  and  the 
discussion  in  the  next  chapter)  and  econometric  problems  (Geweke  1984)  led  to 
a  decline  in  the  practical  importance  of  this  concept. 

We  propose  two  econometric  implementations  of  the  causality  concept.  The 
first  one  is  based  on  a  VAR,  the  second  one  is  non-parametric  and  uses  the 
cross-correlations.  In  addition,  we  propose  an  interpretation  in  terms  of  the  causal 
representation,  respectively  the  Wold  Decomposition  Theorem  (see  Chap.  14) 

15.1.1  VAR  Approach 

If  one  restricts  oneself  to  linear  least-squares  forecasts,  the  above  definition  can 
be  easily  operationalized  in  the  context  of  VAR  models  with  only  two  variables  (see 
also  Sims  1972).  Consider  first  a  VAR(l)  model.  Then  according  to  the  explanations 
in  Chap.  14  the  one-period  forecast  is: 


and  therefore 


Pr^i.r+i  —  0 1 1 2f  i  t  +  ‘P'tiXzr- 


'Compare  this  to  the  concept  of  a  causal  representation  developed  in  Sects.  12.3  and  2.3. 

2Sometimes  also  the  concept  of  contemporaneous  causality  is  considered.  This  concept  is, 
however,  controversial  and  has  therefore  not  gained  much  success  in  practice  and  will,  therefore, 
not  be  pursued. 
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If  (pn  =  0  then  the  second  variable  does  not  contribute  to  the  one-period  forecast 
of  the  first  variable  and  can  therefore  be  omitted:  MSEi  (1)  =  MSEi  (1).  Note  that 


where  *  is  a  placeholder  for  an  arbitrary  number.  Thus  the  second  variable  is  not 
only  irrelevant  for  the  one-period  forecast,  but  for  any  forecast  horizon  h  >  1 .  Thus, 
the  second  variable  is  not  causal  for  the  first  variable  in  the  sense  of  Wiener-Granger 
causality. 

These  arguments  can  be  easily  extended  to  VAR(p)  models.  According  to 
Eq.  (14.4)  we  have  that 

P7A1  ,r+i  =  cp[\X\T  +  +  ■  ■  ■  +  4>((\X\j-P+\  +  (pnXi.T-p+x 

where  <f>^  denotes  (i.j)-th  element,  i  =  1,2,  of  the  matrix  <J>*,  k  =  p.  In 

order  for  the  second  variable  to  have  no  influence  on  the  forecast  of  the  first  one, 
we  must  have  that  =  4>n  =  •  ■  ■  =  4$  =  0-  This  implies  that  all  matrices 

<t>k,  k  =  1 . p.  must  be  lower  triangular,  i.e.  they  must  be  of  the  form 

As  the  multiplication  and  addition  of  lower  triangular  matrices  is  again  a  lower 
triangular  matrix,  the  second  variable  is  irrelevant  in  forecasting  the  first  one  at  any 
forecast  horizon.  This  can  be  seen  by  computing  the  corresponding  forecast  function 
recursively  as  in  Chap.  14. 

Based  on  this  insight  it  is  straightforward  to  test  the  null  hypothesis  that  the 
second  variable  does  not  cause  the  first  one  within  the  VAR(p)  context: 

Ho  :  {X2t}  does  not  cause  {Xif}. 

In  terms  of  the  VAR  model  this  hypothesis  can  be  stated  as: 

Ho  •  9>12  —  012  —  ■  •  •  —  012  —  U- 

The  alternative  hypothesis  is  that  the  null  hypothesis  is  violated.  As  the  method 
of  least-squares  estimation  leads  under  quite  general  conditions  to  asymptotically 
normal  distributed  coefficient  estimates,  it  is  straightforward  to  test  the  above 
hypothesis  by  a  Wald-test  (F-test).  In  the  context  of  a  VAR(l)  model  a  simple  t- 
test  is  also  possible. 

If  more  than  two  variables  are  involved  the  concept  of  Wiener-Granger  causality 
is  no  longer  so  easy  to  implement.  Consider  for  expositional  purposes  a  VAR(l) 
model  in  three  variables  with  coefficient  matrix: 
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(011  012  0 
021  022  023 
031  032  033 

The  one-period  forecast  function  of  the  first  variable  then  is 

P7-X1  ,r+l  =  011^17"  +  012^2  T- 

Thus,  the  third  variable  AY/  is  irrelevant  for  the  one-period  forecast  of  the  first 
variable.  However,  as  the  third  variable  has  an  influence  on  the  second  variable, 
023  /  0,  and  because  the  second  variable  has  an  influence  on  the  first  variable, 
012  /  0,  the  third  variable  will  provide  indirectly  useful  information  for  the  forecast 
of  the  first  variable  for  forecasting  horizons  h  >  2.  Consequently,  the  concept  of 
causality  cannot  immediately  be  extended  from  two  to  more  than  two  variables. 

It  is,  however,  possible  to  merge  variables  one  and  two,  or  variables  two  and 
three,  into  groups  and  discuss  the  hypothesis  that  the  third  variable  does  not  cause 
the  first  two  variables,  seen  as  a  group;  likewise  that  the  second  and  third  variable, 
seen  as  a  group,  does  not  cause  the  first  variable.  The  corresponding  null  hypotheses 
then  are: 


Ho  :  023  =  013  =  0  or  H0  :  0 12  =  0i3  =  0. 

Under  these  null  hypotheses  we  get  again  lower  (block-)  triangular  matrices: 


0n  0i2  :  0 

f  0n 

0  0  N 

021  022  :  0 

or 

021 

022  023 

V031  032  :  033/ 

V031 

032  033/ 

Each  of  these  hypotheses  can  again  be  checked  by  a  Wald-test  (F-test). 


1 5.1 .2  Wiener-Granger  Causality  and  Causal  Representation 

We  can  get  further  insights  into  the  concept  of  causality  by  considering  a  bivariate 
VAR,  O(L)A)  =  Zf,  with  causal  representation  X,  =  0(L)Z,.  Partitioning  the 
matrices  according  to  the  two  variables  {AY}  and  {AY},  Theorem  12.1  of  Sect.  12.3 
implies  that 


/ <t>n(z)  f't’niz)  012(4)  \  _  ( 1  0\ 

\02i  (z)  022(4)/  V  02! (z)  022(4)/  VO  1/ 
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where  the  polynomials  4>i2(z)  and  <J>2i(z)  have  no  constant  terms.  The  hypothesis 
that  the  second  variable  does  not  cause  the  first  one  is  equivalent  in  this  framework 
to  the  hypothesis  that  <t> 1 2  (z)  =  0.  Multiplying  out  the  above  expression  leads  to  the 
condition 


$11  (z) ^12(2)  =  0. 


Because  <t>i  1  (z)  involves  a  constant  term,  the  above  equation  implies  that 
T*  1 2  (z)  =  0.  Thus  the  causal  representation  is  lower  triangular.  This  means  that 
the  first  variable  is  composed  of  the  first  shock,  {Z\t}  only  whereas  the  second 
variable  involves  both  shocks  {Z\t}  and  {Z2t}.  The  univariate  causal  representation 
of  {Xi,}  is  therefore  the  same  as  the  bivariate  one.3  Finally,  note  the  similarity  to  the 
issue  of  the  identification  of  shocks  discussed  in  subsequent  sections. 


15.1.3  Cross-Correlation  Approach 

In  the  case  of  two  variables  we  also  examine  the  cross-correlations  to  test  for 
causality.  This  non-parametric  test  has  the  advantage  that  one  does  not  have  to 
rely  on  an  explicit  VAR  model.  This  advantage  becomes  particularly  relevant,  if  a 
VMA  model  must  be  approximated  by  a  high  order  AR  model.  Consider  the  cross¬ 
correlations 


Pn(h)  =  corr(Xu,X2,t-h). 


If  Pn{h)  /  0  for  h  >  0,  we  can  say  that  the  past  values  of  the  second  variable 
are  useful  for  forecasting  the  first  variable  such  that  the  second  variable  causes  the 
first  one  in  the  sense  of  Wiener  and  Granger.  Another  terminology  is  that  the  second 
variable  is  a  leading  indicator  for  the  first  one.  If  in  addition,  P\2(h)  ^  0,  for  h  <  0, 
so  that  the  past  values  of  the  first  variable  help  to  forecast  the  second  one,  we  have 
causality  in  both  directions. 

As  the  distribution  of  the  cross-correlations  of  two  independent  variables  depends 
on  the  autocorrelation  of  each  variable,  see  Theorem  1 1.4,  Haugh  (1976)  and  Pierce 
and  Haugh  (1977)  propose  a  test  based  on  the  filtered  time  series.  Analogously  to 
the  test  for  independence  (see  Sect.  1 1.2),  we  proceed  in  two  steps: 

Step  1:  Estimate  in  the  first  step  a  univariate  AR(p)  model  for  each  of  the  two 
time  series  {Xi,}  and  {X2t}.  Thereby  chose  p  such  that  the  corresponding  residuals 
{Z1(}  and  { Z2t  j  are  white  noise.  Note  that  although  {Z\,}  and  {Z2t}  are  both  not 
autocorrelated,  the  cross-correlations  Pzi,z2W  may  still  be  non- zero  for  arbitrary 
orders  h. 


3As  we  are  working  with  causal  VAR’s,  the  above  arguments  also  hold  with  respect  to  the  Wold 
Decomposition. 
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Step  2:  As  {Zi,}  and  { Z2/ j  are  the  forecast  errors  based  on  forecasts  which  rely 
only  on  the  own  past,  the  concept  of  causality  carries  over  from  the  original 
variables  to  the  residuals.  The  null  hypothesis  that  the  second  variable  does  not 
cause  the  first  variable  in  the  sense  of  Wiener  and  Granger  can  then  be  checked 
by  the  Haugh-Pierce  statistic: 


L 

Haugh-Pierce  statistic:  T  Y,Pzi-Z2(h)  ~  XI-  (15-D 

h=  1 

Thereby  Zl(h),  h  =  1,2 _ _  denotes  the  squared  estimated  cross-correlation 

coefficients  between  {Zi,}  and  { Z2l \ .  Under  the  null  hypothesis  that  the  second 
variable  does  not  cause  the  first  one,  this  test  statistic  is  distributed  as  a  / 2 
distribution  with  L  degrees  of  freedom. 


1 5.2  Structural  and  Reduced  Form 
15.2.1  A  Prototypical  Example 

The  discussion  in  the  previous  section  showed  that  the  relation  between  VAR  models 
and  economic  models  is  ambiguous.  In  order  to  better  understand  the  quintessence 
of  the  problem,  we  first  analyze  a  simple  macroeconomic  example.  Let  {y,}  and 
{m,}  denote  the  output  and  the  money  supply  of  an  economy4  and  suppose  that 
the  relation  between  the  two  variables  is  represented  by  the  following  simultaneous 
equation  system: 

AD-curve:  Xu  =  y,  =  a\m,  +  yuy,-\  +  ynm,- 1  +  vyt 

policy  reaction  curve:  X2t  =  m,  =  a2y,  +  ynyt-i  +  yiim-i  +  vm, 

Note  the  contemporaneous  dependence  of  y,  on  m,  in  the  AD-curve  and  a  corre¬ 
sponding  dependence  of  mt  on  y,  in  the  policy  reaction  curve.  These  equations  are 
typically  derived  from  economic  reasoning  and  may  characterize  a  model  explicitly 
derived  from  economic  theory.  In  statistical  terms  the  simultaneous  equation  system 
is  called  the  structural  form.  The  error  terms  {ity,}  and  {vmt)  are  interpreted  as 
demand  shocks  and  money  supply  shocks,  respectively.  They  are  called  structural 
shocks  and  are  assumed  to  follow  a  multivariate  white  noise  process: 

V,  =  ( ^  ~  WN  (0,  Q)  with  ft  =  I  °,  )  . 

\W  V0  M>n) 


4If  one  is  working  with  actual  data,  the  variables  are  usually  expressed  in  log-differences  to  achieve 
stationarity. 
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Note  that  the  two  structural  shocks  are  assumed  to  be  contemporaneously  uncor¬ 
related  which  is  reflected  in  the  assumption  that  £2  is  a  diagonal  matrix.  This 
assumption  in  the  literature  is  uncontroversial.  Otherwise,  there  would  remain  some 
unexplained  relationship  between  them.  The  structural  shocks  can  be  interpreted 
as  the  statistical  analog  of  an  experiment  in  the  natural  sciences.  The  experiment 
corresponds  in  this  case  to  a  shift  of  the  AD-curve  due  to,  for  example,  a  temporary 
non-anticipated  change  in  government  expenditures  or  money  supply.  The  goal  of 
the  analysis  is  then  to  trace  the  reaction  of  the  economy,  in  our  case  represented 
by  the  two  variables  jy, j  and  {/;?,(,  to  these  isolated  and  autonomous  changes 
in  aggregate  demand  and  money  supply.  The  structural  equations  imply  that  the 
reaction  is  not  restricted  to  contemporaneous  effects,  but  is  spread  out  over  time. 
We  thus  represent  this  reaction  by  the  impulse  response  function. 

We  can  write  the  system  more  compactly  in  matrix  notation: 


or 


AX,  =  TXt-\  +  BV, 


where  A  = 


and  B 


l  -oi\  r  _  ( Yu  y  12 ' 

y—ai  1  /  VK21  K22> 

Assuming  that  0^2  ^  1,  we  can  solve  the  above  simultaneous  equation  system 
for  the  two  endogenous  variables  y,  and  m,  to  get  the  reduced  form  of  the  model: 


/I 

VO  l)' 


yn  +  fliX2i  ,  K12  +  «i/22  ,  vyt  a\vm, 

X\,  =yt=  — - yt- 1  h — - m,~\  +  - — - h  - 

1—  a\ci2  \  —  a\a2  1—  a\ci2  l—aia2 

=  01  lVf— 1  +  012'Mf-l  +  -Zi, 

„  /21  +  a2yn  ,  K22  +  02/12  ,  a2Vy,  vm, 

X2t  =  m,  =  - y,_i  -| - m,-\  - ; - 1 - 

1— ai02  ’  1  —  0i02  1—  a\a2  \  —  a\a2 

=  02lVf— 1  +  </>22'«/-l  +  Z2t. 


Thus,  the  reduced  form  has  the  structure  of  a  VAR(l)  model  with  error  term 
{Z,j  =  {(Z i,,  Z2,)'}.  The  reduced  form  can  also  be  expressed  in  matrix  notation  as: 

X,  =  A_1TX,_i  +  A~1BV, 


=  +  Z, 
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where 

Z,  ~  WN(0,  £)  with  £  =  BQ,B' A'~l . 

Whereas  the  structural  form  represents  the  inner  economic  relations  between  the 
variables  (economic  model),  the  reduced  form  given  by  the  VAR  model  summarizes 
their  outer  directly  observable  characteristics.  As  there  is  no  unambiguous  relation 
between  the  reduced  and  structural  form,  it  is  impossible  to  infer  the  inner 
economic  relationships  from  the  observations  alone.  This  is  known  in  statistics 
as  the  identification  problem.  Typically,  a  whole  family  of  structural  models  is 
compatible  with  a  particular  reduced  form.  The  models  of  the  family  are  thus 
observationally  equivalent  to  each  other  as  they  imply  the  same  distribution  for  {Xt} . 
The  identification  problem  can  be  overcome  if  one  is  willing  to  make  additional 
a  priori  assumptions.  The  nature  and  the  type  of  these  assumption  and  their 
interpretation  is  subject  of  the  rest  of  this  chapter. 

In  our  example,  the  parameters  characterizing  the  structural  and  the  reduced 
form  are 


{ai,a2,  Yu,  Yu,  Y2i,Y22,cOy,co*} 


and 


{fill,  012, 021-022 -  of,  o’  12,  of}. 

As  there  are  eight  parameters  in  the  structural  form,  but  only  seven  parameters  in  the 
reduced  form,  there  is  no  one-to-one  relation  between  structural  and  reduced  form. 
The  VAR(l)  model  delivers  estimates  for  the  seven  reduced  form  parameters,  but 
there  is  no  way  to  infer  from  these  estimates  the  parameters  of  the  structural  form. 
Thus,  there  is  a  fundamental  identification  problem. 

The  simple  counting  of  the  number  of  parameters  in  each  form  tells  us  that  we 
need  at  least  one  additional  restriction  on  the  parameters  of  the  structural  form. 
The  simplest  restriction  is  a  zero  restriction.  Suppose  that  a 2  equals  zero,  i.e.  that 
the  central  bank  does  not  react  immediately  to  current  output.  This  seems  reasonable 
because  national  accounting  figures  are  usually  released  with  some  delay.  With  this 
assumption,  we  can  infer  the  structural  parameters  from  the  reduced  ones: 

/21  =  021,  Y22  =  022- 

Vm,  =  =>  ftf,  =  of,  =>  fll  =  012/of ,  =>  (Dy  =  of  -  of2/of 

7n  =  0\\  ~  (012/of  )<^>21 ,  yi2  =  0\2  —  (oi2/of)</>22. 

Remark  15.1.  Note  that,  because  Z,  =  A~lBVh  the  reduced  form  disturbances  Z, 
are  a  linear  combination  of  the  structural  disturbances,  in  our  case  the  demand  dis¬ 
turbance  vyt  and  the  money  supply  disturbance  vmt.  In  each  period  t  the  endogenous 
variables  output  y,  and  money  supply  m,  are  therefore  hit  simultaneously  by  both 
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shocks.  It  is  thus  not  possible  without  further  assumptions  to  assign  the  movements 
in  Z,  and  consequently  in  X,  to  corresponding  changes  in  the  fundamental  structural 
shocks  vyt  and  vmt. 


Remark  15.2.  As  Cooley  and  LeRoy  (1985)  already  pointed  out,  the  statement 
“money  supply  is  not  causal  in  the  sense  of  Wiener  and  Granger  for  real  economic 
activity”,  which,  in  our  example  is  equivalent  to  <p\2  =  0,  is  not  equivalent  to  the 
statement  “money  supply  does  not  influence  real  economic  activity"  because  f\2  can 
be  zero  without  <q  being  zero.  Thus,  the  notion  of  causality  is  not  very  meaningful 
in  inferring  the  inner  (structural)  relationships  between  variables. 


1 5.2.2  Identification:  The  General  Case 

We  now  present  the  general  identification  problem  in  the  context  of  VAR.3  The 
starting  point  of  the  analysis  consists  of  a  linear  model,  derived  ideally  from 
economic  theory,  in  its  structural  form: 


AX,  =  HA,.,  +  . . .  +  TpXt-p  +  BV,  (15.2) 

where  V,  are  the  structural  disturbances.  These  disturbances  usually  have  an 
economic  interpretation,  for  example  as  demand  or  supply  shocks.  A  is  a  n  x  n 
matrix  which  is  normalized  such  that  the  diagonal  consists  of  ones  only.  The  matrix 
B  is  also  normalized  such  that  its  diagonal  contains  only  ones.  The  process  of 
structural  disturbances,  {Vf},  is  assumed  to  be  a  multivariate  white  noise  process 
with  a  diagonal  covariance  matrix  £2 : 


V, 


WN(0,  £2) 


with 


£2  =  ev,v;  = 


(co\  0  ...  0\ 
0  u>\  ...  0 


\0  0  ...col) 


The  assumption  that  the  structural  disturbance  are  uncorrelated  with  each  other  is 
not  viewed  as  controversial  as  otherwise  there  would  be  unexplained  relationships 
between  them.  In  the  literature  one  encounters  an  alternative  completely  equivalent 
normalization  which  leaves  the  coefficients  in  B  unrestricted  but  assumes  the 
covariance  matrix  of  Vt,  £2,  to  be  equal  to  the  identity  matrix  /„. 

The  reduced  form  is  obtained  by  solving  the  equation  system  with  respect  to  X,. 
Assuming  that  A  is  nonsingular,  the  pre multiplication  of  Eq.  (15.2)  by  A"  1  leads  to 
the  reduced  form  which  corresponds  to  a  VAR(p)  model: 


SA  thorough  treatment  of  the  identification  problem  in  econometrics  can  be  found  in  Rothenberg 
(1971),  and  for  the  VAR  context  in  Rubio-Rarmrez  et  al.  (2010). 
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x,  =  A-'ri^-i  +  . . .  +  A~]rpXt-p  +  A~1BV, 
=  +  . . .  +  <&pX,-p  +  z,. 


(15.3) 


The  relation  between  the  structural  disturbances  V,  and  the  reduced  form  distur¬ 
bances  Z,  is  in  the  form  of  a  simultaneous  equation  system: 


AZ,  =  BVt. 


(15.4) 


While  the  structural  disturbances  are  not  directly  observed,  the  reduced  form 
disturbances  are  given  as  the  residuals  of  the  VAR  and  can  thus  be  considered  as 
given.  The  relation  between  the  lagged  variables  is  simply 


Consequently,  once  A  and  B  have  been  identified,  not  only  the  coefficients  of 
the  lagged  variables  in  the  structural  form  are  identified,  but  also  the  impulse 
response  functions  (see  Sect.  15.4.1).  We  can  therefore  concentrate  our  analysis 
of  the  identification  problem  on  Eq.  (15.4). 

With  these  preliminaries,  it  is  now  possible  to  state  the  identification  problem 
more  precisely.  Equation  (15.4)  shows  that  the  structural  form  is  completely 
determined  by  the  parameters  (A.B.Q,).  Taking  the  normalization  of  A  and  B  into 
account,  these  parameters  can  be  viewed  as  points  in  R',(2"  11 .  These  parameters 
determine  the  distribution  of  Z,  =  A~lBV,  which  is  completely  characterized  by 
the  covariance  matrix  of  Z,,  E,  as  the  mean  is  equal  to  zero.5  Thus,  the  parameters 
of  the  reduced  form,  i.e.  the  independent  elements  of  E  taking  the  symmetry  into 
account,  are  points  in  R"|,,+ 1 ,  2.  The  relation  between  structural  and  reduced  form 
can  therefore  be  described  by  a  function  g  :  J{'!(«+I)/2I 


E  =  g(A,B.  £2)  =A-XBQ.B'A'-1. 


(15.5) 


Ideally,  one  would  want  to  find  the  inverse  of  this  function  and  retrieve,  in  this  way, 
the  structural  parameters  (A,  B,  (2)  from  E.  This  is,  however,  in  general  not  possible 
because  the  dimension  of  the  domain  space  of  g,  n(2n  —  1),  is  strictly  greater,  for 
n  >  2,  than  the  dimension  of  its  range  space,  n ( n  +  1  )/2.  This  discrepancy  between 
the  dimensions  of  the  domain  and  the  range  space  of  g  is  known  as  the  identification 
problem.  To  put  it  in  another  way,  there  are  only  n(n  +  l)/2  (nonlinear)  equations 
for  n(2n  —  1)  unknowns.6 7 


6As  usual,  we  concentrate  on  the  first  two  moments  only. 

7Note  also  that  our  discussion  of  the  identification  problem  focuses  on  local  identification,  i.e.  the 
invertibility  of  g  in  an  open  neighborhood  of  E.  See  Rothenberg  (1971)  and  Rubio-Ramirez  et  al. 
(2010)  for  details  on  the  distinction  between  local  and  global  identification. 
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To  overcome  the  identification  problem,  we  have  to  bring  in  additional  infor¬ 
mation.  A  customary  approach  is  to  impose  a  priori  assumptions  on  the  structural 
parameters.  The  Implicit  Function  Theorem  tells  us  that  we  need 

3  n(n  —  l)/2  =  n(2n  —  1)  —  n(n  +  l)/2  (15.6) 

such  restrictions,  so-called  identifying  restrictions,  to  be  able  to  invert  the  function 
g.  Note  that  this  is  only  a  necessary  condition  and  that  the  identification  problem 
becomes  more  severe  as  the  dimension  of  the  VAR  increases  because  the  number  of 
restrictions  grows  at  a  rate  proportional  to  n2. 

This  result  can  also  be  obtained  by  noting  that  the  function  g  in  Eq.  (15.5)  is 
invariant  under  the  following  transformation  h: 

h  :  (A,B,Q)  — *  (RA.RBQ,ll1QQ.-xl2 'DSID-') 

where  R,  Q  and  D  are  arbitrary  invertible  matrices  such  that  R  respects  the 
normalization  of  A  and  B,  Q  is  an  orthogonal  matrix,  and  D  is  a  diagonal  matrix. 
It  can  be  verified  that 


(g  o  h) (A ,  B,  £2)  =  g(A,  B,  £2). 

The  dimensions  of  the  matrices  R,  Q,  and  D  are  n2  —  2 n,  n(n  —  l)/2,  and  n, 
respectively.  Summing  up  gives  3 n(n  —  l)/2  =  n2  —  2 n  +  n(n—  l)/2  +  n  degrees 
of  freedom  as  before8. 

The  empirical  economics  literature  proposed  several  alternative  identification 
schemes: 

(i)  Short-run  restrictions  place  restrictions,  usually  zero  restrictions,  on  the 
immediate  impact  of  structural  shocks  (among  many  others,  Sims  1980b; 
Blanchard  1989;  Blanchard  and  Watson  1986;  Christiano  et  al.  1999).  See 
Sect.  15.3  for  details. 

(ii)  Long-run  restrictions  place  restrictions,  usually  zero-restrictions,  on  the  long- 
run  impact  structural  shocks  (Blanchard  and  Quah  1989;  Gall  1992).  See 
Sect.  15.5  for  details. 

(iii)  Maximization  of  the  contribution  to  the  forecast  error  variance  of  some 
variable  at  some  horizon  with  respect  to  a  particular  shock  (Faust  1998;  Uhlig 
2004;  Francis  et  al.  2014;  Uhlig,  2003,  What  drives  real  GNP?  unpublished). 
This  method  has  seen  an  interesting  application  in  the  identification  of 
news  shocks  (see  Barsky  and  Sims  2011).  Further  details  will  discussed  in 
Sect.  15.4.2. 

(iv)  Sign  restrictions  restrict  the  set  of  possible  impulse  response  functions  (see 
Sect.  15.4.1)  to  follow  a  given  sign  pattern  (Faust  1998;  Uhlig  2005;  Fry  and 


“See  Neusser  (2016)  for  further  implications  of  viewing  the  identification  problem  from  an 
invariance  perspective. 


266 


15  Interpretation  of  VAR  Models 


Pagan  2011;  Kilian  and  Murphy  2012;  Rubio-Ramfrez  et  al.  2010;  Arias  et  al. 
2014;  Baumeister  and  Hamilton  2015).  This  approach  is  complementary  to 
the  two  previous  identification  schemes  and  will  be  discussed  in  Sect.  15.6. 

(v)  Identification  through  heteroskedasticity  (Rigobon  2003) 

(vi)  Restrictions  derived  from  a  dynamic  stochastic  general  equilibrium  (DSGE) 
model.  These  restrictions  often  come  as  nonlinear  cross-equation  restrictions 
and  are  viewed  as  the  hallmark  of  rational  expectations  models  (Hansen  and 
Sargent  1980).  Typically,  the  identification  issue  is  overcome  by  imposing  a 
priori  restrictions  via  a  Bayesian  approach  (Negro  and  Schorfheide  (2004) 
among  many  others). 

(vii)  Identification  using  information  on  global  versus  idiosyncratic  shocks  in  the 
context  of  multi-country  or  multi-region  VAR  models  (Canova  and  Ciccarelli 
2008;  Dees  et  al.  2007) 

(viii)  Instead  of  identifying  all  parameters,  researchers  may  be  interested  in  iden¬ 
tifying  only  one  equation  or  a  subset  of  equations.  This  case  is  known  as 
partial  identification.  The  schemes  presented  above  can  be  extended  in  a 
straightforward  manner  to  the  partial  identification  case. 

These  schemes  are  not  mutually  exclusive,  but  can  be  combined  with  each  other. 
In  the  following  we  will  only  cover  the  identification  through  short-  and  long- 
run  restrictions,  because  these  are  by  far  the  most  popular  ones.  The  economic 
importance  of  these  restrictions  for  the  analysis  of  monetary  policy  has  been 
emphasized  by  Christiano  et  al.  (1999). 


15.2.3  Identification:  The  Case  n  =  2 

Before  proceeding  further,  it  is  instructive  to  analyze  the  case  n  =  2  in  more  detail.9 
Assume  for  simplicity  A  =  I2,  then  the  equation  system  (15.5)  can  be  written 
explicitly  as 


of  =  w\  +  (B)i2®2 

CT12  =  +  (B) \20>2 

02  =  (B)IM  +  ®2 

with  unknowns  (2?)  12 ,  (-8)21,  and  o>\.  Note  that  the  assumption  of  E  being 
positive  definite  implies  that  (B)  12(5)21  1-  Thus,  we  can  solve  out  w\  and  w\ 

and  reduce  the  three  equations  to  only  one: 

{(B)n  ~  &211)  ((5)21  -  b£)  =  r^2  -  1  (15.7) 

where  £>21  and  b  12  denote  the  least-squares  regression  coefficients  of  Z2/  on  Z\t, 
respectively  of  Z\t  on  Z2r,  i.e.  £>21  =  012/of  and  £>12  =  cr^/af •  ri2  is  the  correlation 


9The  exposition  is  inspired  by  Learner  (1981). 
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Fig.  15.1  Identification  in  a 
two-dimensional  structural 
VAR  with  CT12  >  0 


coefficient  between  Z2t  and  Z\t,  i.e.  r\i  =  Onf  Note  that  imposing  a  zero 

restriction  by  setting  (B)  l2,  for  example,  equal  to  zero,  implies  that  (B) 21  equals  b2\ ; 
and  vice  versa,  setting  (B)2 1  =  0,  implies  (B)  12  =  b  12.  As  a  final  remark,  the  right 
hand  side  of  Eq.  (15.7)  is  always  positive  as  the  inverse  of  the  squared  correlation 
coefficient  is  bigger  than  one.  This  implies  both  product  terms  must  be  of  the  same 
sign. 

Equation  (15.7)  delineates  all  possible  combinations  of  (B)  12  and  (B) 2\  which 
are  compatible  with  a  given  covariance  matrix  E.  Its  graph  represents  a  rectangular 
hyperbola  in  the  parameter  space  {(B)X2,(B)2\)  with  center  C  =  (b2\  , b\2) 
and  asymptotes  (B) l2  =  b2 '/  and  (B)2\  =  b\2  and  is  plotted  in  Fig.  15.1. 10 
The  hyperbola  consist  of  two  disconnected  branches  with  a  pole  at  the  center 
C  =  (b21\b^).  At  this  point,  the  relation  between  the  two  parameters  changes 
sign.  The  figure  also  indicates  the  two  possible  zero  restrictions  (5)  12  =  0  and 
(B) 21  =  0,  called  short-run  restrictions.  These  two  restrictions  are  connected  and 
its  path  completely  falls  within  one  quadrant.  Thus,  along  this  path  the  sign  of  the 
parameters  remain  unchanged. 

Suppose  that  instead  of  fixing  a  particular  parameter,  we  only  want  to  restrict 
its  sign.  Assuming  that  (B)  12  >  0  implies  that  (B) 21  must  lie  in  one  of  the  two 
disconnected  intervals  (— oo,fi2i]  and  (fi^1, +00).11  Although  not  very  explicit, 
some  economic  consequences  of  this  topological  particularity  are  discussed  in  Fry 
and  Pagan  (2011).  Alternatively,  assuming  (B) 12  <  0  implies  (B) 21  e  [b2\,b^2). 
Thus,  (B)2\  is  unambiguously  positive.  Sign  restrictions  for  (B) 2\  can  be  discussed 
in  a  similar  manner.  Section  15.6  discuses  sign  restrictions  more  explicitly. 


10Moon  et  al.  (2013;  section  2)  provided  an  alternative  geometric  representation. 

11  That  these  two  intervals  are  disconnected  follows  from  the  fact  that  X  is  positive  definite. 
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1 5.3  Identification  via  Short-Run  Restrictions 


Short-run  restrictions  represent  the  most  common  identification  scheme  encoun¬ 
tered  in  practice.  They  impose  direct  linear  restrictions  on  the  structural  parameters 
A  and  B  and  restrict  in  this  way  the  contemporaneous  effect  of  the  structural  shocks 
on  the  variables  of  the  system.  The  most  common  type  of  such  restrictions  are  zero 
restrictions  which  set  certain  coefficients  a  priori  to  zero.  These  zero  restrictions 
are  either  derived  from  an  explicit  economic  theory  or  are  based  on  some  ad 
hoc  arguments.  As  explained  above,  it  is  necessary  to  have  at  least  3 n(n  —  l)/2 
restrictions  at  hand.  If  there  are  more  restrictions,  we  have  an  overidentified  system. 
This  is,  however,  rarely  the  case  in  practice  because  the  number  of  necessary 
restrictions  grows  at  a  rate  proportional  to  n2.  The  case  of  overidentification  is,  thus, 
not  often  encountered  and  as  such  is  not  treated.12  The  way  to  find  appropriate 
restrictions  in  a  relatively  large  system  is  documented  in  Sect.  15.4.5.  If  the  number 
of  restrictions  equals  3 n(n  —  l)/2,  we  say  that  the  system  is  exactly  identified. 

Given  the  necessary  number  of  a  priori  restrictions  on  the  coefficients  A  and  B , 
there  are  two  ways  to  infer  A,  B,  and  £2.  The  first  one  views  the  relation  (15.4) 
as  a  simultaneous  equation  system  in  Z ,  with  error  terms  V,  and  to  estimate  the 
coefficients  by  instrumental  variables  as  in  Blanchard  and  Watson  (1986). 13  The 
second  way  relies  on  the  method  of  moments  and  solves  the  nonlinear  equation 
system  (15.5)  as  in  Bernanke  (1986).  In  the  case  of  exact  identification  both  methods 
are  numerically  equivalent. 

In  our  example  treated  of  Sect.  15.2.1,  we  had  n  =  2  so  that  three  restrictions 
were  necessary  (six  parameters,  but  only  three  equations).  These  three  restrictions 
were  obtained  by  setting  B  =  I2  which  gives  two  restrictions  (i.e.  b  12  =  b2\  =  0). 
The  third  restriction  was  to  set  the  immediate  reaction  of  money  supply  to  a  demand 
shock  to  zero,  i.e.  to  set  a 2  =  0.  We  then  showed  that  these  three  restrictions  are 
also  sufficient  to  solve  the  nonlinear  equation  system  (15.5). 

Sims  (1980b)  proposed  in  his  seminal  article  the  VAR  approach  as  an  adequate 
alternative  to  then  popular  structural  simultaneous  approach.  In  particular,  he 
suggested  a  simple  recursive  identification  scheme.  This  scheme  takes  A  =  In  so 
that  the  equation  system  (15.5)  simplifies  to: 

£  =  BQ.B'. 

Next  we  assume  that  B  is  a  lower  triangular  matrix: 


/I  0...0\ 
*  1  ...  0 


V*  * ...  1/ 


12The  case  of  overidentification  is,  for  example,  treated  in  Bernanke  (1986). 

13A  version  of  the  instrumental  variable  (IV)  approach  is  discussed  in  Sect.  15.5.2. 


1 5.3  Identification  via  Short-Run  Restrictions 


269 


where  *  is  just  a  placeholder.  The  matrices  B  and  £2  are  uniquely  determined  by  the 
Cholesky  decomposition  of  the  matrix  E.  The  Cholesky  decomposition  factorizes 
a  positive-definite  matrix  E  uniquely  into  the  product  BQB'  where  B  is  a  lower 
triangular  matrix  with  ones  on  the  diagonal  and  a  diagonal  matrix  £2  with  strictly 
positive  diagonal  entries  (Meyer  2000).  As  Z,  =  B V, ,  Sims’  identification  gives  rise 
to  the  following  interpretation,  vn  is  the  only  structural  shock  which  has  an  effect  on 
X\,  in  period  t.  All  other  shocks  have  no  contemporaneous  effect.  Moreover,  Z\,  = 
v\t  so  that  the  residual  from  the  first  equation  is  just  equal  to  the  first  structural  shock 
and  that  rrp  =  u>\.  The  second  variable  Xu  is  contemporaneously  only  affected  by 
V\t  and  i)2 1,  and  not  by  the  remaining  shocks  v^t, . . . ,  vnt.  In  particular,  Zot  =  &21  th/+ 
V2t  so  that  Z?2i  can  be  retrieved  from  the  equation  021  =  hi\0)\.  This  identifies  the 
second  structural  shock  i>2 <  and  (,o\ .  Due  to  the  triangular  nature  of  B ,  the  system 
is  recursive  and  all  structural  shocks  and  parameters  can  be  identified  successively. 
The  application  of  the  Cholesky  decomposition  as  an  identification  scheme  rests 
crucially  on  the  ordering  of  the  variables  (Xu,  X21, . . . ,  X,,,)'  in  the  system. 

Sims'  scheme,  although  easy  to  implement,  becomes  less  plausible  as  the  number 
of  variables  in  the  system  increases.  For  this  reason  the  more  general  scheme  with 
A  /  /„  and  B  not  necessarily  lower  triangular  are  more  popular.  However,  even  for 
medium  sized  systems  such  as  n  =  5,  30  restrictions  are  necessary  which  stresses 
the  imagination  even  of  brilliant  economists  as  the  estimation  of  Blanchard’s  model 
in  Sect.  15.4.5  shows. 

Focusing  on  the  identification  of  the  matrices  A  and  B  brings  also  an  advantage 
in  terms  of  estimation.  As  shown  in  Chap.  13,  the  OLS-estimator  of  the  VAR 

coefficient  matrices  <3>i . <t>;)  equals  the  GLS-estimator  independently  of  E. 

Thus,  the  estimation  of  the  structural  parameters  can  be  broken  down  into  two 
steps.  In  the  first  step,  the  coefficient  matrices  <t>  1 , . . . .  <t>;)  are  estimated  using 
OLS.  The  residuals  are  then  used  to  estimate  E  which  leads  to  an  estimate  of  the 
covariance  matrix  (see  Eq.  (13.6)).  In  the  second  step,  the  coefficients  of  A,  B,  and 
£2  are  then  estimated  given  the  estimate  of  E,  E,  by  solving  the  nonlinear  equation 
system  (15.5)  taking  the  specific  identification  scheme  into  account.  Thereby  E  is 
replaced  by  its  estimate  E.  As  \ff  ^vech(E)  —  vech(E)  j  converges  in  distribution 

to  a  normal  distribution  with  mean  zero,  A,  B  and  £2  are  also  asymptotically  normal 
because  they  are  obtained  by  a  one-to-one  mapping  from  E.14  The  Continuous 
Mapping  Theorem  further  implies  that  A,  B  and  £2  converge  to  their  true  means 
and  that  their  asymptotic  covariance  matrix  can  be  obtained  by  an  application  of  the 
delta-method  (see  Theorem  E.l  in  the  Appendix  E).  Further  details  can  be  found 
in  Bernanke  (1986),  Blanchard  and  Watson  (1986),  Giannini  (1991),  Hamilton 
(1994b),  and  Sims  (1986). 


14The  vech  operator  transforms  a  symmetric  n  x  n  matrix  E  into  a  \n(n  +  1)  vector  by  stacking 
the  columns  of  E  such  that  each  element  is  listed  only  once. 
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1 5.4  Interpretation  of  VAR  Models 
1 5.4.1  Impulse  Response  Functions 

The  direct  interpretation  of  VAR  models  is  rather  difficult  because  it  is  composed  of 
many  coefficients  so  that  it  becomes  difficult  to  understand  the  dynamic  interactions 
between  the  variables.  It  is  therefore  advantageous  to  simulate  the  dynamic  effects 
of  the  different  structural  shocks  by  computing  the  impulse  response  functions. 
They  show  the  effect  over  time  of  the  structural  shocks  on  the  variables  at  issue. 
These  effects  can  often  be  related  to  the  underlying  economic  model  and  are  thus 
at  the  heart  of  the  VAR  analysis.  The  examples  in  Sect.  15.4.4  and  15.4.5  provide 
some  illustrations  of  this  statement. 

The  impulse  response  functions  are  derived  from  the  causal  representation13  of 
the  VAR  process  (see  Sect.  12.3): 


X,  —  Z,  +  *Ti  Z,_i  +  'I/2-Zf-2  +  •  •  • 


=  A-'BV,  +  BV,~\  +  ^2A~'BV,-2  +  . . .  (15.8) 


The  effect  of  the  y-th  structural  disturbance  on  the  i-th  variable  after  h  periods, 
denoted  by  "Xf+''  is  thus  given  by  the  (/,_/)- th  element  of  the  matrix 


dXjj+i, 

dvjr 


[^i,a~'b]..  . 


Clearly,  the  impulse  response  functions  depends  on  the  identification  scheme 
chosen.  There  are  n2  impulse  response  functions  if  the  system  consists  of  n 
variables.  Usually,  the  impulse  response  functions  are  represented  graphically  as 
a  plot  against  h. 


1 5.4.2  Forecast  Error  Variance  Decomposition 

Another  instrument  for  the  interpretation  of  VAR  models  is  the  forecast  error 
variance  decomposition  (“FEVD”)  or  variance  decomposition  for  short  which 
decomposes  the  total  forecast  error  variance  of  a  variable  into  the  variances  of 
the  structural  shocks.  It  is  again  based  on  the  causal  representation  of  the  VAR(p) 
model.  According  to  Eq.  (14.3)  in  Chap.  14  the  variance  of  the  forecast  error  or 
mean  squared  error  (MSE)  is  given  by: 


MSE(A)  =  E(Xl+h  -  PtXt+h)(Xt+h  -  P ,X,+hy 


15Sometimes  the  causal  representation  is  called  the  MA(oo)  representation. 
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h-\  h-\ 

j=  o  7=0 


Given  a  specific  identification  scheme  and  estimates  of  the  structural  parameters, 
it  is  possible  to  attribute  the  MSE  to  the  variance  of  the  structural  disturbances. 
Thereby  it  is  customary  to  write  the  contribution  of  each  disturbance  as  a  percentage 
of  the  total  variance.  For  this  purpose  let  us  write  the  MSE(/z)  explicitly  as 


/  (A) 

/  mn  * 


MSE  (h)  = 


(h) 

n22 


m 


Our  interest  lies  exclusively  on  the  variances,  mu  ,i=  I . n.  so  that  we  represent 

the  uninteresting  covariance  terms  by  the  placeholder  * .  These  variances  can  be  seen 
as  a  linear  combination  of  the  cof’s  because  the  covariance  matrix  of  the  structural 
disturbances  =  diag  (wj, . . . ,  u> is  a  diagonal  matrix: 


W 

m 


dn)(0 1  + 


+  A2„ 


or  in  matrix  form 


=  e; 


l-  7 


e/ 


where  the  vector  e,  has  entries  equal  to  zero,  except  for  the  z'-th  entry  which  is  equal 

(h) 

to  one.  Given  the  positive  definiteness  of  S,  the  weights  d-  ,  i,j  =  1, . . . ,  zi  and 
h  =  1, 2, . . .,  are  strictly  positive.  They  can  be  computed  as 


4° = 

\k= 0  / 

In  order  to  arrive  at  the  percentage  value  of  the  contribution  of  the  j-th  disturbance 

(h) 

to  the  MSE  of  the  z'-th  variable  at  forecast  horizon  /?,  denoted  by  f-  ,  we  divide  each 
summand  in  the  above  expression  by  the  total  sum: 


i,j  =  1 , . . . ,  n ,  for  h  =  0,1,2 _ 
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The  corresponding  matrix  expression  is 

e\  (j2jZo  Q. XI1B'A'~ 1  e, 

4(E£o  W)* 

Usually,  these  numbers  are  multiplied  by  100  to  give  percentages  and  are  either 
displayed  graphically  as  a  plot  against  h  or  in  table  form  (see  the  example  in 
Sect.  15.4.5).  The  forecast  error  variance /j-  thus  shows  which  percentage  of  the 
forecast  variance  of  variable  i  at  horizon  h  can  be  attributed  to  the  y-th  structural 
shock  and  thus  measures  the  contribution  of  each  of  these  shocks  to  the  overall 
fluctuations  of  the  variables  in  question. 

The  FEVD  can  be  used  as  an  alternative  identification  scheme,  sometimes  called 
max  share  identification.  Assume  for  the  ease  of  exposition  that  A  =  /„.  The  VAR 
disturbances  and  the  structural  shocks  are  then  simply  related  as  Z,  =  BV,  (compare 
Eq.  (15.4)).  Moreover,  take  =  /„,  but  leave  B  unrestricted.  This  corresponds  to 
a  different,  but  equivalent  normalization  which  economizes  on  the  notation.  Then 
the  y-th  structural  disturbance  can  be  identified  by  assuming  that  it  maximizes  the 
forecast  error  variance  share  with  respect  to  variable  i.  Noting  that,  given  E,  B  can 
be  written  as  B  =  RQ  with  R  being  the  unique  Cholesky  factor  of  E  and  Q  being  an 
orthogonal  matrix,  this  optimization  problem  can  be  casted  as 


'I 'jj  e,  s.t.  q'qj  =  1 

where  qj  is  the  y-th  column  of  Q,  i.e.  qj  =  Qe,.  The  constraint  q'-qj  =  1  normalizes 
the  vector  to  have  length  1.  From  Z,  =  BV,  it  then  follows  that  corresponding 
structural  disturbance  is  equal  to  Vj,  =  q'jR~lZ,.  Because  e'/Q'R~x  Y.R'  1  Qe.k  =  0 
for  j  k,  this  shock  is  orthogonal  to  the  other  structural  disturbances.  For 
practical  applications  it  is  advisable  for  reasons  of  numerical  stability  to  transform 
to  optimization  problem  into  an  equivalent  eigenvalue  problem  (see  Faust  1998; 
appendix). 


(h-l 

Y  'V.Rqiq'ji' 

u 


15.4.3  Confidence  Intervals 

The  impulse  response  functions  and  the  variance  decomposition  are  the  most 
important  tools  for  the  analysis  and  interpretation  of  VAR  models.  It  is,  therefore, 
of  importance  not  only  to  estimate  these  entities,  but  also  to  provide  corresponding 
confidence  intervals  to  underpin  the  interpretation  from  a  statistical  perspective.  In 
the  literature  two  approaches  have  been  established:  an  analytic  and  a  bootstrap 
approach.  The  analytic  approach  relies  on  the  fact  that  the  coefficient  matrices 
'T/,,  h  =  1,2 _ _  are  continuously  differentiable  functions  of  the  estimated  VAR 
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coefficients:  vec( T/,)  =  Fh(/3)  where  ft  denotes  as  in  Chap.  13  the  vectorized 
form  of  the  VAR  coefficient  matrices  <t>| . . 1 6  The  relation  between  VAR 

coefficients  and  the  causal  representation  (MA(oo)  representation)  was  established 

2  2 

in  Sect.  12.3.  This  discussion  shows  that  the  functions  Fh  :  IV"  — >  R,!  are  highly 
nonlinear.  In  Chap.  13  it  was  shown  that  \/T  ( /)  —  /)  j  — - — >  N  (0,  E  V  r~!)  so 
that  we  can  apply  the  Continuous  Mapping  Theorem  (see  Theorem  E.  1  or  Serfling 
(1980;  122-124)),  sometimes  also  called  the  Delta  method  to  get 

Vr  (w)  -  Fm)  -U  N  (o,  (s  ®  iy1)  ) 

where  in  practice  the  covariance  matrix  is  estimated  by  replacing  /3  and  E  by 
their  estimate.  The  computation  of  the  gradient  matrices  dF^  is  rather  involved, 
especially  when  h  becomes  large.  Details  can  be  found  in  Liitkepohl  (1990,  2006) 
and  Mittnik  and  Zadrozny  (1993). 

The  use  of  this  asymptotic  approximation  has  two  problems.  First,  the  complexity 
of  the  relationship  between  the  <F,’s  and  the  TVs  augments  with  h  so  that  the 
quality  of  the  approximation  diminishes  with  h  for  any  given  sample  size.  This 
is  true  even  when  /I  is  exactly  normally  distributed.  Second,  the  distribution  of 
ft  is  approximated  poorly  by  the  normal  distribution.  This  is  particularly  relevant 
when  the  roots  of  <J>(L)  are  near  the  unit  circle.  In  this  case,  the  bias  towards 
zero  can  become  substantial  (see  the  discussion  in  Sect.  7.2).  These  two  problems 
become  especially  relevant  as  h  increases.  For  these  reasons  the  analytic  approach 
has  becomes  less  popular. 

The  bootstrap  approach  (Monte  Carlo  or  Simulation  approach),  as  advocated 
by  Runkle  (1987),  Kilian  (1998)  and  Sims  (1999),  has  become  the  most  favored 
approach.  This  is  partly  due  to  the  development  of  powerful  computer  algorithms, 
and  the  increased  speed  in  computations.  The  so-called  naive  bootstrap  approach 
consists  of  several  steps.17 

First  step:  Using  a  random  number  generator  new  disturbances  are  created. 
This  can  be  done  in  two  ways.  The  first  one  assumes  a  particular  distribution 
for  V, :  Vt  ~  N(0.  £2),  for  example.  The  realizations  are  then 

independent  draws  from  this  distribution.  The  second  one,  takes  random  draws 
with  replacement  from  the  identified  realizations  V\, . . . ,  V/.1'  The  second  way 
has  the  advantage  that  non  explicit  distributional  assumption  is  made  which 
results  in  a  better  approximation  of  the  true  distribution  of  V, . 


16Recall  that  the  vec  operator  stacks  the  columns  of  a  n  x  m  matrix  to  get  one  nm  vector. 

17The  bootstrap  is  a  resampling  method.  Efron  and  Tibshirani  (1993)  provide  a  general  introduction 
to  the  bootstrap. 

18The  draws  can  also  be  done  blockwise.  This  has  the  advantage  that  possible  remaining  temporal 
dependences  are  taken  in  account. 
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Second  step:  Given  the  fixed  starting  values  X-p+i, . . . ,  X<h  the  estimated  coef¬ 
ficients  matrices  <f>i, . . . ,  <J>p  and  the  new  disturbances  drawn  in  step  one,  a  new 
realization  of  the  time  series  for  {Xf\  is  generated. 

Third  step:  Estimate  the  VAR  model,  given  the  newly  generated  realizations  for 
{X,},  to  obtain  new  estimates  for  the  coefficient  matrices. 

Fourth  step:  Generate  a  new  set  of  impulse  response  functions  given  the  new 
estimates,  taking  the  identification  scheme  as  fixed. 

The  steps  one  to  four  are  repeated  several  times  to  generate  a  whole  family  of 
impulse  response  functions  which  form  the  basis  for  the  computation  of  the  confi¬ 
dence  bands.  In  many  applications,  these  confidence  bands  are  constructed  in  a  naive 
fashion  by  connecting  the  confidence  intervals  for  individual  impulse  responses  at 
different  horizons.  This,  however,  ignores  the  fact  that  the  impulse  responses  at 
different  horizons  are  correlated  which  implies  that  the  true  coverage  probability  of 
the  confidence  band  is  different  from  the  presumed  one.  Thus,  the  joint  probability 
distribution  of  the  impulse  responses  should  serve  as  the  basis  of  the  computation  of 
the  confidence  bands.  Recently,  several  alternatives  have  been  proposed  which  take 
this  feature  in  account.  Liitkepohl  et  al.  (2013)  provides  a  comparison  of  several 
methods. 

The  number  of  repetitions  should  be  at  least  500.  The  method  can  be  refined 
somewhat  if  the  bias  towards  zero  of  the  estimates  of  the  <f>’s  is  taken  into  account. 
This  bias  can  again  be  determined  through  simulation  methods  (Kilian  1998). 
A  critical  appraisal  of  the  bootstrap  can  be  found  in  Sims  (1999)  where  additional 
improvements  are  discussed.  The  bootstrap  of  the  variance  decomposition  works  in 
similar  way. 


1 5.4.4  Example  1 :  Advertisement  and  Sales 

In  this  example  we  will  analyze  the  dynamic  relationship  between  advertisement 
expenditures  and  sales  by  a  VAR  approach.  The  data  we  will  use  are  the  famous  data 
from  the  Lydia  E.  Pinkham  Medicine  Company  which  cover  yearly  observations 
from  1907  to  1960.  These  data  were  among  the  first  ones  which  have  been  used 
to  quantify  the  effect  of  advertisement  expenditures  on  sales.  The  data  are  taken 
from  Berndt  (1991;  Chapter  8)  where  details  on  the  specificities  of  the  data  and  a 
summary  of  the  literature  can  be  found. 

We  denote  the  two-dimensional  logged  time  series  of  advertisement  expenditures 
and  sales  by  {X,}  =  {(ln(advertisementf),  ln(sales,))'}.  We  consider  VAR  models  of 

order  one  to  six.  For  each  VAR(p),  p  =  1,2 _ _  6,  we  compute  the  corresponding 

information  criteria  AIC  and  BIC  (see  Sect.  14.3).  Whereas  AIC  favors  a  model  of 
order  two,  BIC  proposes  the  more  parsimonious  model  of  order  one.  To  be  on  the 
safe  side,  we  work  with  a  VAR  model  of  order  two  whose  estimates  are  reported 
below: 
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0.145 

'  0.451 

0.642  ^ 

(0.634) 

+ 

(0.174) 

(0.302) 

0.762 

-0.068 

1.245 

(0.333)/ 

V  (0.091) 

(0.159)/ 

'  -0.189 

0.009  N 

+ 

(0.180) 

(0.333) 

-0.176 

-0.125 

V  (0.095) 

(0.175)7 

where  the  estimated  standard  deviations  of  the  coefficients  are  reported  in  parenthe¬ 
sis.  The  estimate  E  of  the  covariance  matrix  E  is 


/ 0.038  0.01 1\ 

Vo.on  o.oioj ' 


The  estimated  VAR(2)  model  is  taken  to  be  the  reduced  form  model.  The  struc¬ 
tural  model  contains  two  structural  shocks:  a  shock  to  advertisement  expenditures, 
Vai.  and  a  shock  to  sales,  Vis',  •  The  disturbance  vector  of  the  structural  shock  is  thus 
{V,}  =  \(  VAt,  Vst)'}-  It  is  related  to  Z,  via  relation  (15.4),  i.e.  AZ,  =  BV,.  To  identify 
the  model  we  thus  need  3  restrictions.19  We  will  first  assume  that  A  =  h  which 
gives  two  restrictions.  A  plausible  further  assumption  is  that  shocks  to  sales  have 
no  contemporaneous  effects  on  advertisement  expenditures.  This  zero  restriction 
seems  justified  because  advertisement  campaigns  have  to  be  planed  in  advance. 
They  cannot  be  produced  and  carried  out  immediately.  This  argument  then  delivers 
the  third  restriction  as  it  implies  that  B  is  a  lower  triangular.  This  lower  triangular 
matrix  can  be  obtained  from  the  Cholesky  decomposition  of  E: 


B  = 


and 


Q  = 


1 0.038  0  \ 

V  o  0.007J' 


The  identifying  assumptions  then  imply  the  impulse  response  functions  plotted 
in  Fig.  15.2. 

The  upper  left  figure  shows  the  response  of  a  sudden  transitory  increase  in 
advertisement  expenditures  by  1  %  (i.e.  of  a  l-%  increase  of  VAt)  to  itself.  This  shock 
is  positively  propagated  to  the  future  years,  but  is  statistically  zero  after  four  years. 
After  four  years  the  shock  even  changes  to  negative,  but  statistically  insignificant, 
expenditures.  The  same  shock  produces  an  increase  of  sales  by  0.3  %  in  the  current 
and  next  year  as  shown  in  the  lower  left  figure.  The  effect  then  deteriorates  and 
becomes  even  negative  after  three  years.  The  right  hand  figures  display  the  reaction 


19Formula  (15.6)  for  11  =  2  gives  3  restrictions. 
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effect  of  an  advertisement  shock 


0  10  20  30 

period 


effect  of  an  advertisement  shock  on  sales 


effect  of  a  sales  shock 
on  advertisement  expenditures 


0  10  20  30 

period 


Fig.  15.2  Impulse  response  functions  for  advertisement  expenditures  and  sales  with  95-% 
confidence  intervals  computed  using  the  bootstrap  procedure 


of  a  sudden  transitory  increase  of  sales  by  1  %.  Again,  we  see  that  the  shock  is 
positively  propagated.  Thereby  the  largest  effect  is  reached  after  two  years  and  then 
declines  monotonically.  The  reaction  of  advertisement  expenditures  is  initially  equal 
to  zero  by  construction  as  it  corresponds  to  the  identifying  assumption  with  regard 
to  B.  Then,  the  effect  starts  to  increase  and  reaches  a  maximum  after  three  years  and 
then  declines  monotonically.  After  15  years  the  effect  is  practically  zero.  The  95-% 
confidence  intervals  are  rather  large  so  that  all  effects  are  no  longer  statistically 
significant  after  a  few  number  of  years. 
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1 5.4.5  Example  2:  IS-LM  Model  with  Phillips  Curve 

In  this  example  we  replicate  the  study  of  Blanchard  (1989)  which  investigates  the 
US  business  cycle  within  a  traditional  IS-LM  model  with  Phillips  curve.20  The 
starting  point  of  his  analysis  is  the  VAR(p)  model: 

X,  =  <f>  |  X,- 1  +  . . .  +  0pXl-p  +  C  D,  +  Z, 

where  {A,}  is  a  five-dimensional  time  series  X,  =  (Yt,  Ut,  Pt,Wt,  M,)' .  The 
individual  elements  of  X,  denote  the  following  variables: 

Y,  . . .  growth  rate  of  real  GDP 

Ut  ■  ■  ■  unemployment  rate 

P,  . . .  inflation  rate 

W,  . . .  growth  rate  of  wages 

M,  . . .  growth  rate  of  money  stock. 

The  VAR  has  attached  to  it  a  disturbance  term  Z,  =  (Zyf,  Z,„.  Zpt,  Zwt,  Zmt)' .  Finally, 
{D,}  denotes  the  deterministic  variables  of  the  model  such  as  a  constant,  time  trend 
or  dummy  variables.  In  the  following,  we  assume  that  all  variables  are  stationary. 

The  business  cycle  is  seen  as  the  result  of  five  structural  shocks  which  impinge 
on  the  economy: 


V,h  ...  aggregate  demand  shock 
Vst  ...  aggregate  supply  shock 
Vpl  ...  price  shock 
Vwt  ■  ■  ■  wage  shock 
Vmt  . . .  money  shock. 

We  will  use  the  IS-LM  model  to  rationalize  the  restrictions  so  that  we  will  be  able 
to  identify  the  structural  form  from  the  estimated  VAR  model.  The  disturbance  of 
the  structural  and  the  reduced  form  models  are  related  by  the  simultaneous  equation 
system: 


AZ,  =  BV, 


where  V,  =  (Vyt,  Vst,  Vpt,  Vwt,  Vmt)'  and  where  A  and  B  are  5x5  matrices  with  ones 
on  the  diagonal.  Blanchard  (1989)  proposes  the  following  specification: 


20The  results  do  not  match  exactly  those  of  Blanchard  (1989),  but  are  qualitatively  similar. 
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(AD):  Zv,  =  Vd,  +  bnVs, 

(OL):  Zut  =  —ci2\Zyt  +  Vs, 

(PS):  Zpt  =  —aaZwt  —  «3iZy,  +  byiVst  +  Vp, 

(WS).  Zwt  —  (243 Zpt  O-AlZut  T"  b 42V st  T"  F wt 

(MR).  Zmt  —  Ct^\Zyf  Cl^lZui  Cl52>Zpf  d54Zwt  T"  Vmt- 

In  matrix  notation  the  above  simultaneous  equation  system  becomes: 


( 1  0  0  0  cO 

(\  bn  0  0  0^ 

(  vA 

£221  1  0  0  0 

Zut 

0  1  000 

Vs, 

#31  0  1  $34  0 

Zpi 

= 

0  632  1  0  0 

Vp, 

0  CI42  ^43  1  0 

Zwt 

0  b42  0  1  0 

Vwt 

yd51  (252  «53  <254  1 ) 

\Zml  y 

v0  0  0  0  ly 

\Vmt  J 

The  first  equation  is  interpreted  as  an  aggregate  demand  (AD)  equation  where  the 
disturbance  term  related  to  GDP  growth,  Zyt,  depends  on  the  demand  shock  Vdt  and 
the  supply  shock  Vst.  The  second  equation  is  related  to  Okun’s  law  (OL)  which 
relates  the  unemployment  disturbance  Z,„  to  the  demand  disturbance  and  the  supply 
shock.  Thereby  an  increase  in  GDP  growth  reduces  unemployment  in  the  same 
period  by  «2i  whereas  a  supply  shock  increases  it.  The  third  and  the  fourth  equation 
represent  a  price  (PS)  and  wage  setting  (WS)  system  where  wages  and  prices  interact 
simultaneously.  Finally,  the  fifth  equation  (MR)  is  supposed  to  determine  the  money 
shock  (MR).  No  distinction  is  made  between  money  supply  and  money  demand 
shocks.  A  detailed  interpretation  of  these  equations  is  found  in  the  original  article 
by  Blanchard  (1989). 

Given  that  the  dimension  of  the  system  is  five  (i.e.  n  =  5),  formula  (15.6) 
instructs  us  that  we  need  3  x  (5  x  4)/2  =  30  restrictions.  Counting  the  number 
of  zero  restrictions  implemented  above,  we  see  that  we  only  have  28  zeros.  Thus 
we  lack  two  additional  restrictions.  We  can  reach  the  same  conclusion  by  counting 
the  number  of  coefficients  and  the  number  of  equations.  The  coefficients  are 
fl2t,  <231, 1234,  <242,  <243,  «5i,  «52,  <253,  (254,  ^12,  b 32,  ^42  and  the  diagonal  elements  of  £2, 
the  covariance  matrix  of  V,.  We  therefore  have  to  determine  17  unknown  coefficients 
out  of  (5  x  6)/2  =  15  equations.  Thus  we  find  again  that  we  are  short  of  two 
restrictions.  Blanchard  discusses  several  possibilities  among  which  the  restrictions 
b  12  =  1.0  and  £234  =  0.1  seem  most  plausible. 

The  sample  period  runs  through  the  second  quarter  in  1959  to  the  second 
quarter  in  2004  encompassing  181  observations.  Following  Blanchard,  we  include  a 
constant  in  combination  with  a  linear  time  trend  in  the  model.  Whereas  BIC  suggests 
a  model  of  order  one,  AIC  favors  a  model  of  order  two.  As  a  model  of  order  one 
seems  rather  restrictive,  we  stick  to  the  VAR(2)  model  whose  estimated  coefficients 
are  reported  below21 : 


21  To  save  space,  the  estimated  standard  errors  of  the  coefficients  are  not  reported. 
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/  0.07  -1.31  0.01 
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-0.46  0.06  -0.02 
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0.13 

0.79  -0.05  0.76 
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v  0.29  -0.06  0.13 
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The  first  column  of  C  relates  to  the  constants,  whereas  the  second  column  gives 
the  coefficients  of  the  time  trend.  From  these  estimates  and  given  the  identifying 
restrictions  established  above,  the  equation  E  =  A~l B£LB' A'~x  uniquely  determines 
the  matrices  A,  B  and  12 : 


A  = 


/  1  0  0  0  0^ 
0.050  10  0  0 

0.038  0  1  0.1  0 

0  1.77  -0.24  1  0 

^0.033  1.10  0.01  -0.13  ly 


B 


( 1  1  0  0  0^ 
0  10  0  0 
0  -1.01  1  0  0 
0  1.55  0  1  0 

^0  0  OOly 
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ft  = 


/ 9.838  0  0  0  0  > 

0  0.037  0  0  0 

0  0  0.899  0  0 

0  0  0  5.162  0 

v  0  0  0  0  10.849y 


In  order  to  give  better  interpretation  of  the  results  we  have  plotted  the  impulse 
response  functions  and  their  95-%  confidence  bands  in  Fig.  15.3.  The  results  show 
that  a  positive  demand  shock  has  only  a  positive  and  statistically  significant  effect 
on  GDP  growth  in  the  first  three  quarters,  after  that  the  effect  becomes  even 
slightly  negative  and  vanishes  after  sixteen  quarters.  The  positive  demand  shock 
reduces  unemployment  significantly  for  almost  fifteen  quarters.  The  maximal  effect 
is  achieved  after  three  to  four  quarters.  Although  the  initial  effect  is  negative,  the 
positive  demand  shock  also  drives  inflation  up  which  then  pushes  up  wage  growth. 
The  supply  shock  also  has  a  positive  effect  on  GDP  growth,  but  it  takes  more  than 
four  quarters  before  the  effect  reaches  its  peak.  In  the  short-run  the  positive  supply 
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Fig.  15.3  Impulse  response  functions  for  the  IS-LM  model  with  Phillips  curve  with  95-% 
confidence  intervals  computed  using  the  bootstrap  procedure  (compare  with  Blanchard  (1989)) 
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shock  even  reduces  GDP  growth.  In  contrast  to  the  demand  shock,  the  positive 
supply  shock  increases  unemployment  in  the  short-run.  The  effect  will  only  reduce 
unemployment  in  the  medium-  to  long-run.  The  effect  on  price  and  wage  inflation 
is  negative. 

Finally,  we  compute  the  forecast  error  variance  decomposition  according  to 
Sect.  15.4.2.  The  results  are  reported  in  Table  15.1.  In  the  short-run,  the  identifying 
restrictions  play  an  important  role  as  reflected  by  the  plain  zeros.  The  demand  shock 
accounts  for  almost  all  the  variance  of  GDP  growth  in  the  short-run.  The  value  of 
99.62  %  for  forecast  horizon  of  one  quarter,  however,  diminishes  as  h  increases  to 
40  quarters,  but  still  remains  with  a  value  86.13  very  high.  The  supply  shock  on  the 


Table  1 5.1  Forecast  error 
variance  decomposition 
(FEVD )  in  terms  of  demand, 
supply,  price,  wage,  and 
money  shocks  (percentages) 


Horizon 

Demand 

Supply 

Price 

Wage 

Money 

Growth  rate  of  real  GDP 

1 

99.62 

0.38 

0 

0 

0 

2 

98.13 

0.94 

0.02 

0.87 

0.04 

4 

93.85 

1.59 

2.13 

1.86 

0.57 

8 

88.27 

4.83 

3.36 

2.43 

0.61 

40 

86.13 

6.11 

4.29 

2.58 

0.89 

Unemployment  rate 


1 

42.22 

57.78 

0 

0 

0 

2 

52.03 

47.57 

0.04 

vO.Ol 

0.00 

4 

64.74 

33.17 

1.80 

0.13 

0.16 

8 

66.05 

21.32 

10.01 

1.99 

0.63 

40 

39.09 

16.81 

31.92 

10.73 

0.89 

Inflation  rate 


1 

0.86 

4.18 

89.80 

5.15 

0 

2 

0.63 

13.12 

77.24 

8.56 

0.45 

4 

0.72 

16.79 

68.15 

13.36 

0.97 

8 

1.79 

19.34 

60.69 

16.07 

2.11 

40 

2.83 

20.48 

55.84 

17.12 

3.74 

Growth  rate  of  wages 


1 

1.18 

0.10 

0.97 

97.75 

0 

2 

1.40 

0.10 

4.30 

93.50 

0.69 

4 

2.18 

2.75 

9.78 

84.49 

0.80 

8 

3.80 

6.74 

13.40 

74.72 

1.33 

40 

5.11 

8.44 

14.19 

70.14 

2.13 

Growth  rate  of  money  stock 


1 

0.10 

0.43 

0.00 

0.84 

98.63 

2 

1.45 

0.44 

0.02 

1.02 

97.06 

4 

4.22 

1.09 

0.04 

1.90 

92.75 

8 

8.31 

1.55 

"o.81 

2.65 

86.68 

40 

8.47 

2.64 

5.77 

4.55 

78.57 
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contrary  does  not  explain  much  of  the  variation  in  GDP  growth.  Even  for  a  horizon 
of  40  quarters,  it  contributes  only  6.1 1  %.  The  supply  shock  is,  however,  important 
for  the  variation  in  the  unemployment  rate,  especially  in  the  short-run.  It  explains 
more  than  50  %  whereas  demand  shocks  account  for  only  42.22  %.  Its  contribution 
diminishes  with  the  increase  of  the  forecast  horizon  giving  room  for  price  and 
wage  shocks.  The  variance  of  the  inflation  rate  is  explained  in  the  short-run  almost 
exclusively  by  price  shocks.  However,  as  the  forecast  horizon  is  increased  supply 
and  wage  shocks  become  relatively  important.  The  money  growth  rate  does  not 
interact  much  with  the  other  variables.  Its  variation  is  almost  exclusively  explained 
by  money  shocks. 


1 5.5  Identification  via  Long-Run  Restrictions 
15.5.1  A  Prototypical  Example 

Besides  short-run  restrictions,  essentially  zero  restrictions  on  the  coefficients  of 
A  and/or  B,  Blanchard  and  Quah  (1989)  proposed  long-run  restrictions  as  an 
alternative  option.  These  long-run  restrictions  have  to  be  seen  as  complementary 
to  the  short-run  ones  as  they  can  be  combined.  Long-run  restrictions  constrain 
the  long-run  effect  of  structural  shocks.  This  technique  makes  only  sense  if  some 
integrated  variables  are  involved,  because  in  the  stationary  case  the  effects  of  all 
shocks  vanish  eventually.  To  explain  this,  we  discuss  the  two-dimensional  example 
given  by  Blanchard  and  Quah  (1989). 

They  analyze  a  two-variable  system  consisting  of  logged  real  GDP  denoted  by 
{Yt}  and  the  unemployment  rate  {U,}.  Logged  GDP  is  typically  integrated  of  order 
one  (see  Sect.  7.3.4  for  an  analysis  for  Swiss  GDP)  whereas  U,  is  considered  to 
be  stationary.  Thus  they  apply  the  VAR  approach  to  the  stationary  process  {V,}  = 
{(AT/,  U,)'}.  Assuming  that  \X,\  is  already  demeaned  and  follows  a  causal  VAR 
process,  we  have  the  following  representations: 


—  'k(L)z,  —  z,  +  Tqz,-!  +  ^2^1-2  +  . . . . 


For  simplicity,  we  assume  that  A  =  I2,  so  that 


where  V,  =  ( Vdt ,  vst)'  ~  WN(0,  £2)  with  £2  =  diag (&>j,  co f).  Thereby  {v*}  and  {ttj,} 
denote  demand  and  supply  shocks,  respectively.  The  causal  representation  of  J  X,  J 
implies  that  the  effect  of  a  demand  shock  in  period  t  on  GDP  growth  in  period  t  +  h 
is  given  by: 
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where  1 1  denotes  the  upper  left  hand  element  of  the  matrix  'Yi,B.  K,+/,  can  be 
written  as  Yt+h  =  A Y,+it  +  +  . . .  +  AF,+  i  +  Y,  so  that  the  effect  of  the 

demand  shock  on  the  level  of  logged  GDP  is  given  by: 


Blanchard  and  Quah  (1989)  propose,  in  accordance  with  conventional  economic 
theory,  to  restrict  the  long-run  effect  of  the  demand  shock  on  the  level  of  logged 
GDP  to  zero: 


This  implies  that 


where  *  is  a  placeholder.  This  restriction  is  sufficient  to  infer  boi  from  the  relation 
[\P(l)]iihn  +  &21  ( 1 )]  12  =  0  and  the  normalization  hi i  =  1: 


[*(i)]n  _  [sqr1]!! 

[*(l)]l2  [4>(l)_1]l2 


The  second  part  of  the  equation  follows  from  the  identity  <I>(z)'T(z)  =  I2  which 
gives  *T(1)  =  <J>(1)_1  for  z  =  1.  The  long-run  effect  of  the  supply  shock  on  Y,  is 
left  unrestricted  and  is  therefore  in  general  nonzero.  Note  that  the  implied  value  of 
£>21  depends  on  <f>(  1 ).  and  thus  on  T>i, . . . ,  The  results  are  therefore,  in  contrast 
to  short-run  restrictions,  much  more  sensitive  to  correct  specification  of  the  VAR. 

The  relation  Z,  =  BV,  implies  that 


or  more  explicitely 
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Taking  £>21  as  already  given  from  above,  this  equation  system  has  three  equations 
and  three  unknowns  bn,0Jd,a)2  which  is  a  necessary  condition  for  a  solution  to 
exist. 

Analytic  Solution  of  the  System 

Because  £>21  is  already  determined  from  the  long-run  restriction,  we  rewrite  the 
equation  system22  explicitly  in  terms  of  £>21 : 

of  =  a>2d  +  b\2m 2 
cr  12  =  b2\U>2  +  b\2(D  2 

02  —  b2l0)d  +  cos . 

Using  the  last  two  equations,  we  can  express  cod  and  cw2  as  functions  of  b\2' 

2  0"i2  b\2o2 

CO,  I  —  ^ 

Z>21  -  ^12^21 

2  _  of  ~  ^2lO'l2 
1  —  b\2b2\ 

These  expressions  are  only  valid  if  £>21  7^  0  and  b\2b2\  /  1.  The  case  £>21  =  0 
is  not  interesting  with  regard  to  content.  It  would  simplify  the  original  equation 
system  strongly  and  would  results  in  £>12  =  an/ a2,  (od  =  (of  of  —  of2)/of  >  0 
and  &)2  =  of  >  0.  The  case  b\2b2\  =  1  contradicts  the  assumption  that  E  is  a 
positive-definite  matrix  and  can  therefore  be  disregarded.2  ' 

Inserting  the  solutions  of  cod  and  a>2  into  the  first  equation,  we  obtain  a  quadratic 
equation  in  £>12: 


(Z?2iaf  -  b^on)  b\ 2  +  (b22xa2  -  a2)  bn  +  (an  -  b2la2)  =  0. 
The  discriminant  A  of  this  equation  is: 

A  =  (Z?21of  -  of)"  -  4  Oknof  -  b22xan)  (0-12  -  b2ia2) 

=  0>2t°f  +  °i)2  -  4^210-12  (Z?2i°f  +  of)  +  4 £f  jof, 

=  {b2\°\  +  °f  ~  lb2ian)2  >  0. 


22This  equation  system  is  similar  to  the  one  analyzed  in  Sect.  15.2.3. 

23 If  £>12^21  =  1.  det  E  =  ofu2  —  °i22  =  0-  This  implies  that  Z\t  and  Z2f  are  perfectly  correlated, 

le-  Pzlt. Z2,  =  =  T 
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The  positivity  of  the  discriminant  implies  that  the  quadratic  equation  has  two  real 
solutions  £>12*  and  b^: 

o\  ~  b2 1CT12  _  J_ 

b2\ol  -  b\xOn  b2\' 

an  -fr2i  of 
a\  -  b2\(i\2 ' 

The  first  solution  b[ 2  can  be  excluded  because  it  violates  the  assumption  b\2b2\  ^  1 
which  stands  in  contradiction  to  the  positive-definiteness  of  the  covariance  matrix  E . 
Inserting  the  second  solution  back  into  the  solution  for  &>J  and  ap,  we  finally  get: 


'12 


c(2) 

'12 


to; 


2  2 
0-102  ' 


'12 


2  -2  -  2/72i 012  +  oy 


fo21CTl 


(o22  -  h2i012)2 
b^af  -  2b2\G\2  +  o| 


>  0 


>  0. 


Because  E  is  a  symmetric  positive-definite  matrix,  o2o|  —  o22  and  the  denominator 
£>2| o2  —  2£>2iOi2  +  o2  are  strictly  positive.  Thus,  both  solutions  yield  positive 
variances  and  we  have  found  the  unique  admissible  solution. 


15.5.2  The  General  Approach 

The  general  case  of  long-run  restrictions  has  a  structure  similar  to  the  case  of  short- 
run  restrictions.  Take  as  a  starting  point  the  structural  VAR  (15.2)  from  Sect.  15.2.2: 

AX,  =  r,A,_i  +  . . .  +  TpX,-p  +  BV„  V,  ~  WN(0,  £2), 

A(L)X,  =  BV, 


where  {V,}  is  stationary  and  causal  with  respect  to  {V,}.  As  before  the  matrix  A 
is  normalized  to  have  ones  on  its  diagonal  and  is  assumed  to  be  invertible,  £2  is  a 
diagonal  matrix  with  £2  =  diag(&>2, . . . , a> 2 ) ,  and  B  is  a  matrix  with  ones  on  the 
diagonal.  The  matrix  polynomial  A(L)  is  defined  as  ,4(L)  =  A  —  TiL  —  . . .  —  F;) If . 
The  reduced  form  is  given  by 


<T(L)X,  =  Z,,  Zf  ~  WN(0,  E) 


where  AZ,  =  BV,  and  A^  =  f),y  =  1, . . .  ,p,  respectively  A 4>(L)  =  A(L). 
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The  long-run  variance  of  {Xt},  J,  (see  Eq.  (11.1)  in  Chap.  11)  can  be  derived 
from  the  reduced  as  well  as  from  the  structural  form,  which  gives  the  following 
expressions: 

J  =  3>(1)_1£  O(l)-1'  =  =  A(l)-1snB'A(l)-1' 

=  ^(1)S  'I'(l)'  =  'T(l)A"1BdB,A,_1'h(iy 

where  X ,  =  T' (L)Z,  denotes  the  causal  representation  of  {X,} .  The  long-run  variance 
J  can  be  estimated  by  adapting  the  methods  in  Sect.  4.4  to  the  multivariate  case. 
Thus,  the  above  equation  system  has  a  similar  structure  as  the  system  (15.5)  which 
underlies  the  case  of  short-run  restrictions.  As  before,  we  get  n{n  +  l)/2  equations 
with  2 n2  —  n  unknowns.  The  nonlinear  equation  system  is  therefore  undetermined 
for  n  >  2.  Therefore,  3 n(n  —  l)/2  additional  equations  or  restrictions  are  necessary 
to  achieve  identification.  Hence,  conceptually  we  are  in  a  similar  situation  as  in  the 
case  of  short-run  restrictions.24 

In  practice,  it  is  customary  to  achieve  identification  through  zero  restrictions 
where  some  elements  of  ( 1)A^4/?,  respectively  <3>(  1 )— 1  A- 1 Z?,  are  set  a  priori 
to  zero.  Setting  the  ij- th  element  [T(l  )/t~'  /!],,  =  [<$>(  1)— XA— 1Z?],y  equal  to  zero 
amounts  to  set  the  cumulative  effect  of  the  /-th  structural  disturbance  VJt  on  the 
/-th  variable  equal  to  zero.  If  the  /-th  variable  enters  Xt  in  first  differences,  as  was 
the  case  for  Y,  in  the  previous  example,  this  zero  restriction  restrains  the  long-run 
effect  on  the  level  of  that  variable. 

An  interesting  simplification  arises  if  one  assumes  that  A  =  /„  and  that  T  ( 1 )  /!  = 
<3>(  1)— is  a  lower  triangular  matrix.  In  this  case,  B  and  £2  can  be  estimated  from 
the  Cholesky  decomposition  of  the  estimated  long-run  variance  J .  Let  J  =  LDL' 
be  the  Cholesky  decomposition  where  L  is  lower  triangular  matrix  with  ones  on 
the  diagonal  and  D  is  a  diagonal  matrix  with  strictly  positive  diagonal  entries. 
As  /  =  0(1 BQB'  <\H  I)  l\  the  matrix  of  structural  coefficients  can  then  be 
estimated  as  B  =  Of  I  )LU~‘ .  The  multiplication  by  the  inverse  of  the  diagonal 
matrix  U  =  diag(0(  I  )L)  is  necessary  to  guarantee  that  the  normalization  of  B 
(diagonal  elements  equal  to  one)  is  respected.  Q  is  then  estimated  as  £2  =  UDU. 

Instead  of  using  a  method  of  moments  approach,  it  is  possible  to  use  an 
instrumental  variable  (IV)  approach.  For  this  purpose  we  rewrite  the  reduced  form 
of  X,  in  the  Dickey-Fuller  form  (see  Eqs.  (7.1)  and  (16.4)): 

AX,  =  —  0(l)X,_i  +  Oi  AX,_i  +  . . .  +  Op_i  AX,-p+i  +  Z,, 

where  O,  =  —  Y^=j+\  ®iJ  =  1,2,...  ,p—  1.  For  the  ease  of  exposition,  we  assume 
B  =  I„  so  that  AZ,  =  V,.  Multiplying  this  equation  by  A  yields: 


A  AX,  —  — A3>(l)X;_i  +  A^i  AX/_i  +  . . .  +  A<Fp_i  AX,—p+\  +  V/.  (15.9) 


24See  Rubio-Ramirez  et  al.  (2010)  for  a  unified  treatment  of  both  type  of  restrictions. 
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Considerfor  simplicity  the  case  that  ,40(1 )  is  a  lower  triangular  matrix.  This  implies 
that  the  structural  shocks  V2l,V2,, ,  Vnt  have  no  long-run  impact  on  the  first 
variable  Xy.  It  is,  therefore,  possible  to  estimate  the  coefficients  Ai2,  A13, . . . ,  A\n 
by  instrumental  variables  taking  X2t-\ ,  Vv-i , . . .  ,Xn  t-i  as  instruments. 

For  n  =  2  the  Dickey-Fuller  form  of  the  equation  system  (15.9)  is: 

(  1  A,2\  fAXu\  =  _  / [A<f>(l)]n  0  \  \  /ViA 

U21  1  )  V[A4>(1)]2i  [AO(l)]22  j  \  ~X2,t-J  [vj  ' 

respectively 

AXy  =  —A\2AX2t  —  [A$(l)]nXi,f_i  +  Vy 
AX2,  =  -A21AXh 

—  [A<F(1)]2iZi,,_i  —  [A<f>(l)]22Z2f_i  +  y2,. 

Thereby  AXy  and  AX2t  denote  the  OLS  residuals  from  a  regression  of  AAV, 
respectively  AX2t  on  ( AZi,,_i ,  AX2,t-\  , . . . ,  AZi  ^+i ,  AX2,t-p+\) .  X2j-\  is  a  valid 
instrument  for  AX2,  because  this  variable  does  not  appear  in  the  first  equation.  Thus, 
Ai2  can  be  consistently  estimated  by  the  IV-approach.  For  the  estimation  of  A2i, 
we  can  use  the  residuals  from  the  first  equation  as  instruments  because  Vy  and 
V2t  are  assumed  to  be  uncorrelated.  From  this  example,  it  is  easy  to  see  how  this 
recursive  method  can  be  generalized  to  more  than  two  variables.  Note  also  that  the 
IV-approach  can  also  be  used  in  the  context  of  short-run  restrictions. 

The  issue  whether  a  technology  shock  leads  to  a  reduction  of  hours  worked  in 
the  short-run,  led  to  a  vivid  discussion  on  the  usefulness  of  long-run  restrictions  for 
structural  models  (Galt  1999;  Christiano  et  al.  2003,  2006;  Chari  et  al.  2008).  From 
an  econometric  point  of  view,  it  turned  out,  on  the  one  hand,  that  the  estimation 
of  <t>(  1 )  is  critical  for  the  method  of  moments  approach.  The  IV-approach,  on  the 
other  hand,  depends  on  the  strength  or  weakness  of  the  instrument  used  (Pagan  and 
Robertson  1998;  Gospodinov  2010). 

It  is,  of  course,  possible  to  combine  both  short-  and  long-run  restrictions 
simultaneously.  An  interesting  application  of  both  techniques  was  presented  by  Galt 
(1992).  In  doing  so,  one  must  be  aware  that  both  type  of  restrictions  are  consistent 
with  each  other  and  that  counting  the  number  of  restrictions  gives  only  a  necessary 
condition. 

Example  3:  Identifying  Aggregate  Demand  and  Supply  Shocks 

In  this  example,  we  follow  Blanchard  and  Quah  (1989)  and  investigate  the  behavior 
of  the  growth  rate  of  real  GDP  and  the  unemployment  rate  for  the  US  over  the 
period  from  the  first  quarter  1979  to  the  second  quarter  2004  (102  observations).  The 
AIC  and  the  BIC  suggest  models  of  order  two  and  one,  respectively.  Because  some 


288 


15  Interpretation  of  VAR  Models 


coefficients  of  Oi  are  significant  at  the  10%  level,  we  prefer  to  use  the  VAR(2) 
model  which  results  in  the  following  estimates25 : 

0.070  -3.376) 

0.026  1.284/ 

0.029  3.697) 

0.022  -0.320 j 

7.074  -0.382) 

0.382  0.053/' 

These  results  can  be  used  to  estimate  0(1)  =  h  —  $i  —  O2  and  consequently  also 

0(1)  =  O(l)-1 : 


$(!)  = 


(0.901  -0.321) 
\0.048  0.036/ 


$(i)  =  Sar1 


(  0.755  6.718) 
V— 1.003  18.832/  ' 


Assuming  thatZ,  =  BVt  and  following  the  argument  in  Sect.  15.5.1  that  the  demand 
shock  has  no  long  -run  impact  on  the  level  of  real  GDP,  we  can  retrieve  an  estimate 
for  (<21 : 


621  =  -[0(l)]n/[0(l)]i2  =  -0.112. 


The  solution  of  the  quadratic  equation  for  b\o  are  —8.894  and  43.285.  As  the  first 
solution  results  in  a  negative  variance  for  <y|,  we  can  disregard  this  solution  and  stick 
to  the  second  one.  The  second  solution  makes  also  sense  economically,  because  a 
positive  supply  shock  leads  to  positive  effects  on  GDP.  Setting  b\i  =  43.285  gives 
the  following  estimates  for  covariance  matrix  of  the  structural  shocks  £2: 


= 


fcbj  0)  _  (4.023  0  ) 

VOfti2/  V  0  0.0016/ 


The  big  difference  in  the  variance  of  both  shocks  clearly  shows  the  greater 
importance  of  demand  shocks  for  business  cycle  movements. 

Figure  15.4  shows  the  impulse  response  functions  of  the  VAR(2)  identified  by 
the  long-run  restriction.  Each  figure  displays  the  dynamic  effect  of  a  demand  and  a 
supply  shock  on  real  GDP  and  the  unemployment  rate,  respectively,  where  the  size 
of  the  initial  shock  corresponds  to  one  standard  deviation.  The  result  conforms  well 
with  standard  economic  reasoning.  A  positive  demand  shock  increases  real  GDP 


25The  results  for  the  constants  are  suppressed  to  save  space. 
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effect  of  demand  shock  to  GDP 


effect  of  supply  shock  to  GDP 


0  10  20  30  40 

period 


effect  of  demand  shock  to  the  unemployment  rate  effect  of  supply  shock  to  the  unemployment  rate 


period  period 


Fig.  15.4  Impulse  response  functions  of  the  Blanchard-Quah  model  (Blanchard  and  Quah  1989) 
with  95-%  confidence  intervals  computed  using  the  bootstrap  procedure 


and  lowers  the  unemployment  rate  in  the  short-run.  The  effect  is  even  amplified 
for  some  quarters  before  it  declines  monotonically.  After  30  quarters  the  effect  of 
the  demand  has  practically  vanished  so  that  its  long-run  effect  becomes  zero  as 
imposed  by  the  restriction.  The  supply  shock  has  a  similar  short-run  effect  on  real 
GDP,  but  initially  increases  the  unemployment  rate.  Only  when  the  effect  on  GDP 
becomes  stronger  after  some  quarters  will  the  unemployment  rate  start  to  decline. 
In  the  long-run,  the  supply  shock  has  a  positive  effect  on  real  GDP  but  no  effect 
on  unemployment.  Interestingly,  only  the  short-run  effects  of  the  demand  shock  are 
statistically  significant  at  the  95-%  level. 


15.6  Sign  Restrictions 

In  recent  years  the  use  of  sign  restrictions  has  attracted  a  lot  of  attention.  Pioneering 
contributions  have  been  provided  by  Faust  (1998),  Canova  and  De  Nicolo  (2002), 
and  Uhlig  (2005).  Since  then  the  literature  has  abounded  by  many  applications  in 
many  contexts.  Sign  restrictions  try  to  identify  the  impact  of  the  structural  shocks  by 
requiring  that  the  signs  of  the  impulse  response  coefficients  have  to  follow  a  given 
pattern.  The  motivation  behind  this  development  is  that  economists  are  often  more 
confident  about  the  sign  of  an  economic  relationship  than  about  its  exact  magnitude. 
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This  seems  to  be  true  also  for  zero  restrictions,  whether  they  are  short-  or  long-run 
restrictions.  This  insight  already  led  Samuelson  (1947)  to  advocate  a  calculus  of 
qualitative  relations  in  economics.  Unfortunately,  this  approach  has  been  forgotten 
in  the  progression  of  economic  analysis.26  With  the  rise  in  popularity  of  sign 
restrictions  his  ideas  may  see  a  revival. 

To  make  the  notion  of  sign  restrictions  precise,  we  introduce  a  language  based 
on  the  following  notation  and  definitions.  Sign  restrictions  will  be  specified  as  sign 
pattern  matrices.  These  matrices  make  use  of  the  sign  function  of  a  real  number  x, 
sgn(x),  defined  as 


[  1 ,  if  x  >  0; 

sgn(x)  =  <  — 1,  if  x  <  0; 

1  0,  if  x  =  0. 

Definition  15.1.  A  sign  pattern  matrix  or  pattern  for  short  is  a  matrix  whose 
elements  are  solely  from  the  set  {1,  —1, 0}.  Given  a  sign  pattern  matrix  S,  the  sign 
pattern  class  of  S  is  defined  by 

S(S )  =  {M  e  M(«)  |  sgn(M)  =  S} 

where  M(n)  is  the  set  of  all  n  x  n  matrices  and  where  sgn  is  applied  elementwise 
to  M.  Clearly ,  S{S)  =  ( sgn  (.S’) ). 

Remark  15.1.  Often  the  set  {  +  ,  — ,  0}  is  used  instead  of  {  +  1,  —1, 0}  to  denote  the 
sign  patterns. 

Remark  15.2.  In  some  instances  we  do  not  want  to  restrict  all  signs  but  only  a  subset 
of  it.  In  this  case,  the  elements  of  the  sign  pattern  matrix  S  may  come  from  the  larger 
set  {—1, 0, 1,  #}  where#  stands  for  an  unspecified  sign.  S  is  then  called  a  generalized 
sign  pattern  matrix  or  a  generalized  pattern.  The  addition  +  and  the  multiplication 
x  of  (generalized)  sign  pattern  matrices  are  defined  in  a  natural  way. 

In  order  not  to  overload  the  discussion,  we  set  A  =  In  and  focus  on  the  case  where 
Z,  =  BVt.21  Moreover,  we  use  a  different,  but  completely  equivalent  normalization. 
In  particular,  we  relax,  on  the  one  hand,  the  assumption  that  B  has  only  ones  on  its 
diagonal,  but  assume  on  the  other  hand  that  Q  =  !„.  Note  that  E  =  EZ,Z'  =  Bl>  is 
a  strictly  positive  definite  matrix.  Assuming  that  a  causal  representation  of  { X,  j  in 
terms  of  {Z,}  and  thus  also  in  terms  of  { V,}  exists,  we  can  represent  X ,  as 


26It  is  interesting  to  note  that  Samuelson’s  ideas  have  fallen  on  fruitful  grounds  in  areas  like 
computer  science  or  combinatorics  (see  Brnaldi  and  Shader  1995;  Hall  and  Li  2014). 

27The  case  with  general  A  matrices  can  be  treated  analogously. 
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X,  =  BV,  +  tPi  BV,-i  +  y2BV,-  2  +  ...  =  J2  ^hBVt-h  =  'I'f'L  )BVt. 

h= 0 

Denote  by  B( E)  the  set  of  invertible  matrices  B  such  that  E  =  BB',  i.e.  B( E)  = 
{B  e  GL(«)  |  BB'  =  E}.28  Thus,  B( E)  is  the  set  of  all  feasible  structural 
factorizations  (models). 

Sign  restrictions  on  the  impulse  responses  can  then  be  defined  in  terms 

of  a  sequence  of  (generalized)  sign  pattern  matrices  { S /,},  h  =  0.  1,2,... 

Definition  15.2  (Sign  Restrictions).  A  causal  VAR  allows  a  sequence  of  (general¬ 
ized)  sign  pattern  matrices  {5/,}  if  and  only  if  there  exists  B  e  B(  E)  such  that 

VhB  €  S(Sh),  for  all  h  =  0, 1,2, . . .  (15.10) 

Remark  15.3.  As  ||\P,-.B||  converges  to  zero  for  j  — >  oo,  it  seems  reasonable  to 
impose  sign  restrictions  only  up  to  some  horizon  /zmax  <  oo.  In  this  case,  S/, , 
h  >  hmm ,  is  equal  to  the  generalized  sign  pattern  matrix  whose  elements  consist 
exclusively  of  #’s.  A  case  of  particular  interest  is  given  by  /zmax  =  0.  In  this  case,  we 
drop  the  index  0  and  say  that  the  VAR  allows  (generalized)  sign  pattern  matrix  S. 

Remark  15.4.  With  this  notation  we  can  also  represent  (short-run)  zero  restrictions 
if  the  sign  patterns  are  restricted  to  0  and  #. 

A  natural  question  to  ask  is  how  restrictive  a  prescribed  sign  pattern  is.  This 
amounts  to  the  question  whether  a  given  VAR  can  be  compatible  with  any  sign 
pattern.  As  is  already  clear  from  the  discussion  of  the  two-dimensional  case  in 
Sect.  15.2.3,  this  is  not  the  case.  As  the  set  of  feasible  parameters  can  be  represented 
by  a  rectangular  hyperbola,  there  will  always  be  one  quadrant  with  no  intersection 
with  the  branches  of  the  hyperbola.  In  the  example  plotted  in  Fig.  15.1,  this  is 
quadrant  III.  Thus,  configurations  with  (B)2\  <  0  and  (B) \2  <  0  are  not  feasible 
given  (S)i2  >  0.  This  argument  can  be  easily  extended  to  models  of  higher 
dimensions.  Thus,  there  always  exist  sign  patterns  which  are  incompatible  with  a 
given  E. 

As  pointed  out  in  Sect.  15.3  there  always  exists  a  unique  lower  triangular  matrix 
R,  called  the  Cholesky  factor  of  E,  such  that  E  =  RRr .  Thus,  B(  E)  f  0  because 
R  e  B( E). 

Lemma  15.1.  Let  the  Cholesky  factor  of  E  be  R,  then 

B( E)  =  {B  €  GL(zz)  |  3 Q  e  O(n)  :  B  =  RQ}. 


28GL (n)  is  known  as  the  general  linear  group.  It  is  the  set  of  all  invertible  n  x  n  matrices. 
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Proof.  Suppose  B  =  RQ  with  Q  e  0(n),  then  BB'  =  RQQ'R'  =  X.  Thus, 
B  €  BCE).  If  B  €  B(E),  define  Q  =  R~lB.  Then,  QQT  =  RTxBB\R!Yx  = 
R~XE(R!YX  =  R~XRR!(R!YX  =  Thus,  Q  e  <D(n).  □ 

This  lemma  establishes  that  there  is  a  one-to-one  function  (pY  from  the  group 
of  orthogonal  matrices  O(n)  onto  the  set  of  feasible  structural  factorizations  B(E). 
From  the  proof  we  see  that  (pz(Q)  =  RQ  and  <pf]  (B)  =  R~lB.  Moreover,  for 
any  two  matrices  B\  and  B2  in  B(E)  with  B\  =  RQ\  and  B2  =  RQi ,  there  exists 
an  orthogonal  matrix  Q  equal  to  Q'2Qi  such  that  B\  =  B2Q.  As  <pY  and  (pf]  are 
clearly  continuous,  <p y  is  a  homeomorphism.  See  Neusser  (2016)  for  more  details 
and  further  implications. 

To  make  the  presentation  more  transparent,  we  focus  on  sign  restrictions  only  and 
disregard  zero  restrictions.  Arias  et  al.  (2014)  show  how  sign  and  zero  restrictions 
can  be  treated  simultaneously.  Thus,  the  entries  of  (5),}  are  elements  of  {— 1 ,  + 1 ,  #} 
only.  Assume  that  a  VAR  allows  sign  patterns  { S /,},  h  =  0,1  ,  ...,/tmax.  Then 
according  to  Definition  15.2,  there  exists  B  e  B{E)  such  that  €  S(Sh)  for  all 

h  =  0,  1 _ _  /tmax.  As  the  (strict)  inequality  restrictions  delineate  an  open  subset  of 

B  €  B(E),  there  exist  other  nearby  matrices  which  also  fulfill  the  sign  restrictions. 
Sign  restrictions  therefore  do  not  identify  one  impulse  response  sequence,  but  a 
whole  set.  Thus,  the  impulse  responses  are  called  set  identified. 

This  set  is  usually  difficult  to  characterize  algebraically  so  that  one  has  to  rely  on 
computer  simulations.  Conditional  on  the  estimated  VAR,  thus  conditional  on  {'F,} 
and  X,  Lemma  15.1  suggests  a  simple  and  straightforward  algorithm  (see  Rubio- 
Ramfrez  et  al.  2010;  Arias  et  al.  2014;  for  further  details): 

Step  1:  Draw  at  random  an  element  Q  from  the  uniform  distribution  on  the  set  of 
orthogonal  matrices  O(n). 

Step  2:  Convert  Q  into  a  random  element  of  £>(X)  by  applying  (pp  to  Q.  As  <pp.  is 
a  homeomorphism  this  introduces  a  uniform  distribution  on  B(  X). 

Step  3:  Compute  the  impulse  responses  with  respect  to  (p~ ( Q) ,  i.e.  compute 

Step  4:  Keep  those  models  with  impulse  response  functions  which  satisfy  the 
prescribed  sign  restrictions  'F/,  cp^(Q)  €  S(Sh),  h  =  0,  1, ... ,  /imax. 

Step  5:  Repeat  steps  1-4  until  a  satisfactory  number  of  feasible  structural  models 
with  impulse  responses  obeying  the  sign  restrictions  have  been  obtained. 

The  implementation  of  this  algorithm  requires  a  way  to  generate  random  draws 
Q  from  the  uniform  distribution  on  O(n).29  This  is  not  a  straightforward  task 
because  the  elements  of  Q  are  interdependent  as  they  must  ensure  the  orthonormality 
of  the  columns  of  Q.  Edelman  and  Rao  (2005)  proposes  the  following  efficient 


29It  can  be  shown  that  this  probability  measure  is  the  unique  measure  p.  on  0(n)  which  satisfies 
the  normalization  p(0(n))  =  1  and  the  (left)-invariance  property  p(QQ)  =  p(Q)  for  every 
Q  C  0(n)  and  Q  £  0{n).  In  economics,  this  probability  measure  is  often  wrongly  referred 
to  as  the  Haar  measure.  The  Haar  measure  is  not  normalized  and  is,  thus,  unique  only  up  to  a 
proportionality  factor. 
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two-step  procedure.  First,  draw  nxn  matrices  X  such  that  X  ~  N(0,  /„ (g>/„),  i.e.  each 
element  of  X  is  drawn  independently  from  a  standard  normal  distribution.  Second, 
perform  the  QR-decomposition  which  factorizes  each  matrix  X  into  the  product  of 
an  orthogonal  matrix  Q  and  an  upper  triangular  matrix  R  normalized  to  have  positive 
diagonal  entries.30 

As  the  impulse  responses  are  only  set  identified,  the  way  to  report  the  results  and 
how  to  conduct  inference  becomes  a  matter  of  discussion.  Several  methods  have 
been  proposed  in  the  literature: 

(i)  One  straightforward  possibility  consists  in  reporting,  for  each  horizon  /;,  the 
median  of  the  impulse  responses.  Although  simple  to  compute,  this  method 
presents  some  disadvantages.  The  median  responses  will  not  correspond  to  any 
of  the  structural  models.  Moreover,  the  orthogonality  of  the  structural  shocks 
will  be  lost.  Fry  and  Pagan  (2011)  propose  the  median-target  method  as  an 
ad  hoc  remedy  to  this  shortage.  They  advocate  to  search  for  the  admissible 
structural  model  whose  impulse  responses  come  closest  to  the  median  ones. 

(ii)  Another  possibility  is  to  search  for  the  admissible  structural  model  which 
maximizes  the  share  of  the  fo recast  error  variance  at  some  horizon  of  a  given 
variable  after  a  particular  shock.31  An  early  application  of  this  method  can  be 
found  in  Faust  (1998).  This  method  remains,  however,  uninformative  about  the 
relative  explanatory  power  of  alternative  admissible  structural  models. 

(iii)  The  penalty  function  approach  by  Mountford  and  Uhlig  (2009)  does  not 
accept  or  reject  particular  impulse  responses  depending  on  whether  it  is  in 
accordance  with  the  sign  restrictions  (see  step  4  in  the  above  algorithm). 
Instead,  it  associates  for  each  possible  impulse  response  function  and  every 
sign  restriction  a  value  which  rewards  a  “correct”  sign  and  penalizes  a  “wrong” 
sign.  Mountford  and  Uhlig  (2009)  propose  the  following  ad  hoc  penalty 
function: /(T)  =  lOOx  if  sgn(v)  is  wrong  and/(x)  =  x  if  sgn(v)  is  correct.  The 
impulse  response  function  which  minimizes  the  total  (standardized)  penalty  is 
then  reported. 

(iv)  The  exposition  becomes  more  coherent  if  viewed  from  a  Bayesian  perspective. 
From  this  perspective,  the  uniform  distribution  on  0(n),  respectively  on  BfE), 
is  interpreted  as  diffuse  or  uninformative  prior  distribution.33  The  admissible 
structural  models  which  have  been  retained  in  step  5  of  the  algorithm  are  then 
seen  as  draws  from  the  corresponding  posterior  distribution.  The  most  likely 
model  is  then  given  by  the  model  which  corresponds  to  the  mode  of  the  poste¬ 
rior  distribution.  This  model  is  associated  an  impulse  response  function  which 


30Given  a  value  of  n ,  the  corresponding  MATLAB  commands  are  [Q,  R]  =qr  (randn  (n)  )  ; 
Q  =  Q*diag(sign(diag(R)));  (see  Edelman  and  Rao  2005). 

3 'The  minimization  of  the  forecast  error  variance  share  have  also  been  applied  as  an  identification 
device  outside  the  realm  of  sign  restrictions.  See  Sect.  15.4.2. 

32Whether  this  distribution  is  always  the  “natural”  choice  in  economics  has  recently  been  disputed 
by  Baumeister  and  Hamilton  (2015). 
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can  then  be  reported.  This  method  also  allows  the  construction  of  100(  1  —  a)-% 
highest  posterior  density  credible  sets  (see  Inoue  and  Kilian  2013;  for  details). 
As  shown  by  Moon  and  Schorfheide  (2012)  these  sets  cannot,  even  in  a  large 
sample  context,  be  interpreted  as  approximate  frequentist  confidence  intervals. 
Recently,  however.  Moon  et  al.  (2013)  proposed  a  frequentist  approach  to  the 
construction  of  error  bands  for  sign  identified  impulse  responses. 
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16 


As  already  mentioned  in  Chap.  7,  many  raw  economic  time  series  are  nonstationary 
and  become  stationary  only  after  some  transformation.  The  most  common  of  these 
transformations  is  the  formation  of  differences,  perhaps  after  having  taken  logs.  In 
most  cases  first  differences  are  sufficient  to  achieve  stationarity.  The  stationarized 
series  can  then  be  analyzed  in  the  context  of  VAR  models  as  explained  in  the 
previous  chapters.  However,  many  economic  theories  are  formalized  in  terms  of 
the  original  series  so  that  we  may  want  to  use  the  VAR  methodology  to  infer  also 
the  behavior  of  the  untransformed  series.  Yet,  by  taking  first  differences  we  loose 
probably  important  information  on  the  levels.  Thus,  it  seems  worthwhile  to  develop 
an  approach  which  allows  us  to  take  the  information  on  the  levels  into  account  and  at 
the  same  time  take  care  of  the  nonstationary  character  of  the  variables.  The  concept 
of  cointegration  tries  to  achieve  this  double  requirement. 

In  the  following  we  will  focus  our  analysis  on  variables  which  are  integrated 
of  order  one,  i.e.  on  time  series  which  become  stationary  after  having  taken  first 
differences.  However,  as  we  have  already  mentioned  in  Sect.  7.5.1,  a  regression 
between  integrated  variables  may  lead  to  spurious  correlations  which  make  statisti¬ 
cal  inferences  and  interpretations  of  the  estimated  coefficients  a  delicate  issue  (see 
Sect.  7.5.3).  A  way  out  of  this  dilemma  is  presented  by  the  theory  of  cointegrated 
processes.  Loosely  speaking,  a  multivariate  process  is  cointegrated  if  there  exists 
a  linear  combination  of  the  processes  which  is  stationary  although  each  process 
taken  individually  may  be  integrated.  In  many  cases,  this  linear  combination  can  be 
directly  related  to  economic  theory  which  has  made  the  analysis  of  cointegrated 
processes  an  important  research  topic.  In  the  bivariate  case,  already  been  dealt 
with  in  Sect.  7.5.2,  the  cointegrating  relation  can  be  immediately  read  off  from  the 
cointegrating  regression  and  the  cointegration  test  boils  down  to  a  unit  root  test  for 
the  residuals  of  the  cointegrating  regression.  However,  if  more  than  two  variables 
are  involved,  the  single  equation  residual  based  test  is,  as  explained  in  Sect.  7.5.2, 
no  longer  satisfactory.  Thus,  a  genuine  multivariate  is  desirable. 
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The  concept  of  cointegration  goes  back  to  the  work  of  Engle  and  Granger 
(1987)  which  is  itself  based  on  the  precursor  study  of  Davidson  et  al.  (1978).  In 
the  meantime  the  literature  has  grown  tremendously.  Good  introductions  can  be 
found  in  Banerjee  et  al.  (1993),  Watson  (1994)  or  Liitkepohl  (2006).  For  the  more 
statistically  inclined  reader  Johansen  (1995)  is  a  good  reference. 


1 6.1  A  Theoretical  Example 

Before  we  present  the  general  theory  of  cointegration  within  the  VAR  context,  it  is 
instructive  to  introduce  the  concept  in  the  well-known  class  of  present  discounted 
value  models.  These  models  relate  some  variable  X,  to  present  discounted  value  of 
another  variable  Y 


OO 

X,  =  y(l  -  0)  Pjp'Y>+J  +  u”  0  <  p  <  1 , 

i= o 

where  u,  ~  WN(0.  <r2)  designates  a  preference  shock.  Thereby,  jJ>  denotes  the 
subjective  discount  factor  and  y  is  some  unspecified  parameter.  The  present 
discounted  value  model  states  that  the  variable  X,  is  proportional  to  the  sum  of 
future  Yt+jJ=  0,1,2,...,  discounted  by  the  factor  p.  We  can  interpret  X,  and  Y, 
as  the  price  and  the  dividend  of  a  share,  as  the  interest  rate  on  long-  and  short-term 
bonds,  or  as  consumption  and  income.  In  order  to  operationalize  this  model,  we 
will  assume  that  forecasts  are  computed  as  linear  mean-squared  error  forecasts.  The 
corresponding  forecast  operator  is  denoted  by  Pf.  Furthermore,  we  will  assume  that 

the  forecaster  observes  Y,  and  its  past  Y,-\ ,  T,_2 _ The  goal  of  the  analysis  is  the 

investigation  of  the  properties  of  the  bivariate  process  { ( X, ,  Y,)'  }.  The  analysis  of  this 
important  class  models  presented  below  is  based  on  Campbell  and  Shiller  (1987). 1 

The  model  is  closed  by  assuming  some  specific  time  series  model  for  {T,}.  In 
this  example,  we  will  assume  that  {Y,}  is  an  integrated  process  of  order  one  (see 
Definition  7.1  in  Sect.  7.1)  such  that  {AT,}  follows  an  AR(1)  process: 

AT,  =  n(l  ~(p)  +  (pAY,-i  +  vt,  \<p\  <  1  and  vt  ~  WN(0,cr„2). 

This  specification  of  the  {T,}  process  implies  that  P,A T,+/,  =  /z(  1  —  (j)h)  +  <ph  A Y,. 

Because  P,T,+/,  =  P,AT,+/,  +  . . .  +  P,AT,+i  +  T,,  h  =  0, 1,2 _ _  the  present 

discounted  value  model  can  be  manipulated  to  give: 


1 A  more  recent  interesting  application  of  this  model  is  given  by  the  work  of  Beaudry  and  Portier 

(2006). 
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X,  —  y(  1  —  P)  \Y,  +  PFtY,+i  +  P  P,Y,+2  +  ...]  +  n, 
=  7(1  -  P)[  Y, 

+  PY,  +  P  P,A7,+  1 

+  p2Y,  +  P2P,AY,+i  +  p2P,AYt+2 

+  P3Y,  +  P3P,AY,+i  +  p2P,AYt+2  +  p3P,AYl+3 


+  ...]  +  u, 

=  7(1  ~P) 


Tl_y,  +  Tl_P,A,,+,  +  JLp,Ay,+2  +  ... 


+  Mf 


This  expression  shows  that  the  integratedness  of  {7,}  is  transferred  to  {Xt}.  Bringing 
Y,  to  the  left  we  get  the  following  expression: 


oo 

S,  =  X,  -  yY,  =  y  PjP^Y,+j  +  u,. 
i=i 


The  variable  S,  is  occasionally  referred  to  as  the  spread.  If  y  is  greater  than  zero, 
expected  increases  in  A  K(+;,  j  >  1,  have  a  positive  impact  on  the  spread  today. 
For  y  =  1,5,  can  denote  the  log  of  the  price-dividend  ratio  of  a  share,  or  minus 
the  logged  savings  ratio  as  in  the  permanent  income  model  of  Campbell  (1987).  If 
investors  expect  positive  (negative)  change  in  the  dividends  tomorrow,  they  want  to 
buy  (sell)  the  share  thereby  increasing  (decreasing)  its  price  already  today.  In  the 
context  of  the  permanent  income  hypothesis  expected  positive  income  changes  lead 
to  a  reduction  in  today’s  saving  rate.  If,  on  the  contrary,  households  expect  negative 
income  changes  to  occur  in  the  future,  they  will  save  already  today  (“saving  for  the 
rainy  days”). 

Inserting  for  PfATf+y,  j  =  0,1,...,  the  corresponding  forecast  equation 
P,A Yt+h  =  fi(  1  —  <ph)  +  cf>h AY,,  we  get: 


S,  = 


py^(\-<p) 

(i-m  -P4>) 


Py# 

1  -P4> 


AY,  Li,. 


The  remarkable  feature  about  this  relation  is  that  {S,}  is  a  stationary  process  because 
both  {AT,}  and  {n,}  are  stationary,  despite  the  fact  that  {7,}  and  {A,}  are  both 
integrated  processes  of  order  one.  The  mean  of  S,  is: 


E  S,  = 


Py m 
i -P' 


From  the  relation  between  S,  and  AT,  and  the  AR(1)  representation  of  {AT,}  we 
can  deduce  a  VAR  representation  of  the  joint  process  {(5,,  AT,)'}: 
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Py _ I  Py<P 

(\-P)(\-PP>)  ^  i ~P4> 

1 


Further  algebraic  transformation  lead  to  a  VAR  representation  of  order  two  for  the 
level  variables  {(X,,  Y,)'}: 


Next  we  want  to  check  whether  this  stochastic  difference  equation  possesses  a 
stationary  solution.  For  this  purpose,  we  must  locate  the  roots  of  the  equation 
det<I>(z)  =  det(/2  —  Oiz  —  $2 z2)  =  0  (see  Theorem  12.1).  As 


the  roots  are  z\  =  1/0  and  zi  =  1  •  Thus,  only  the  root  zi  lies  outside  the  unit  circle 
whereas  the  root  Z2  lies  on  the  unit  circle.  The  existence  of  a  unit  root  precludes 
the  existence  of  a  stationary  solution.  Note  that  we  have  just  one  unit  root,  although 
each  of  the  two  processes  taken  by  themselves  are  integrated  of  order  one. 

The  above  VAR  representation  can  be  further  transformed  to  yield  a  representa¬ 
tion  of  process  in  first  differences  {(AXr,  AT,)'}: 


This  representation  can  be  considered  as  a  generalization  of  the  Dickey-Fuller 
regression  in  first  difference  form  (see  Eq.  (7.1)).  In  the  multivariate  case,  it  is 
known  as  the  vector  error  correction  model  (VECM)  or  vector  error  correction 
representation.  In  this  representation  the  matrix 
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n  =  -$(i)  = 


is  of  special  importance.  This  matrix  is  singular  and  of  rank  one.  This  is  not 
an  implication  which  is  special  to  this  specification  of  the  present  discounted 
value  model,  but  arises  generally  as  shown  in  Campbell  (1987)  and  Campbell  and 
Shiller  (1987).  In  the  VECM  representation  all  variables  except  (X,-\ ,  Kr_i )'  are 
stationary  by  construction.  This  implies  that  —  IT(X,_i,  T,_i)'  must  be  stationary 
too,  despite  the  fact  that  j  (X,,  Yr)'\  is  not  stationary  as  shown  above.  Multiplying 
— Il(X,_i,  K,_| )'  out,  one  obtains  two  linear  combinations  which  define  stationary 
processes.  However,  as  n  has  only  rank  one,  there  is  just  one  linearly  independent 
combination.  The  first  one  is  Aj-i  —  Y  E-i  and  equals  St-  i  which  was  already 
shown  to  be  stationary.  The  second  one  is  degenerate  because  it  yields  zero.  The 
phenomenon  is  called  cointegration. 

Because  n  has  rank  one,  it  can  be  written  as  the  product  of  two  vectors  a  and  /? : 


Clearly,  this  decomposition  of  n  is  not  unique  because  a  =  aa  and  f  =  a~  1y3, 
a  ^  0,  would  also  qualify  for  such  a  decomposition  as  n  =  aft'.  The  vector 
fJ>  is  called  a  cointegration  vector.  It  has  the  property  that  {ft  (X,.  Y,)'  {  defines  a 
stationary  process  despite  the  fact  that  { (X,.  Y,)'}  is  non-stationary.  The  cointegration 
vector  thus  defines  a  linear  combination  of  X,  and  Y,  which  is  stationary.  The  matrix 
a,  here  only  a  vector,  is  called  the  loading  matrix  and  its  elements  the  loading 
coefficients. 

The  VAR  and  the  VECM  representations  are  both  well  suited  for  estimation. 
However,  if  we  want  to  compute  the  impulse  responses,  we  need  a  causal  represen¬ 
tation.  Such  a  causal  representation  does  not  exist  due  to  the  unit  root  in  the  VAR 
process  for  {( X ,,  Y,)'}  (see  Theorem  12.1).  To  circumvent  this  problem  we  split  the 
matrix  <J>(z)  into  the  product  of  two  matrices  M{z)  and  V (z) .  M(z )  is  a  diagonal 
matrix  which  encompasses  all  unit  roots  on  its  diagonal.  V(z)  has  all  its  roots  outside 
the  unit  circle  so  that  V-1  (z)  exists  for  ]z|  <  1  .  In  our  example,  we  get: 


0(")  =  M(z)V(z) 


M(z)Hz )  =  M(z)M(z)V(z)  =  (1  -z)I2V{z)  =  0  ~z)V(z). 
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The  application  of  this  result  to  the  VAR  representation  of  J  (X,,  Y,)}  leads  to  a  causal 
representation  of  {(AX,,  AT,)}: 


M{ L)0(L)  M 


M(L)V(L)(';'  )=c  +  Z, 


=  (l-L)V(L) 


0  Uf1_L0>lz 

1-0)/  V  0  \) 


y \  .  T/-i ,^/i-lo 


+y  (LHVi,z' 


+ 


m  ( ; )  +  ^(L)z,. 


Z, 


The  polynomial  matrix  'T(L)  can  be  recovered  by  the  method  of  undetermined 
coefficients  from  the  relation  between  V(L)  and  T(L): 


V(L)Tt(L)  = 


In  this  exposition,  we  abstain  from  the  explicit  computation  of  V  1  (L)  and  T(L). 
However,  the  following  relation  holds: 


V(l)  = 


1  —y 
0  1-0 


y-'d) 


Implying  that 


The  cointegration  vector  /l  =  (1,  —  y)'  and  loading  matrix  a  =  (—1,0)'  therefore 
have  the  following  properties: 


/I'T'd)  =  (0  0)  and 
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Like  in  the  univariate  case  (see  Theorem  7. 1  in  Sect.  7.1 .4),  we  can  also  construct 
the  Beveridge-Nelson  decomposition  in  the  multivariate  case.  For  this  purpose,  we 
decompose  'F(L)  as  follows: 

'k(L)  =  T'(l)  +  (L  —  l)'F(L) 

with  T,  =  ^”.+1  T,.  This  result  can  be  used  to  derive  the  multivariate  Beveridge- 
Nelson  decomposition  (see  Theorem  16.1  in  Sect.  16.2.3): 


j  j  +  //  ^  j  /  +  4>  ( 1 )  Y'Z,  +  stationary  process 


+  stationary  process 
(X o 


+  stationary  process. 


(16.1) 


The  Beveridge-Nelson  decomposition  represents  the  bivariate  integrated  process 
{(Xf,  Y,)'}  as  a  sum  of  three  components:  a  linear  trend,  a  multivariate  random  walk 
and  a  stationary  process.  Multiplying  the  Beveridge-Nelson  decomposition  from 
the  left  by  the  cointegration  vector  /3  =  (1,  —y)',  we  see  that  both  the  trend  and 
the  random  walk  component  are  eliminated  and  that  only  the  stationary  component 
remains. 

Because  the  first  column  of  'F(l)  consists  of  zeros,  only  the  second  structural 
shock,  namely  {iy},  will  have  a  long-run  (permanent)  effect.  The  long-run  effect  is 
y/(l  —  (j> )  for  the  first  variable,  X,,  and  1/(1  —  ([>)  for  the  second  variable,  Y,.  The 
first  structural  shock  (preference  shock)  {u,}  has  non  long-run  effect,  its  impact  is  of 
a  transitory  nature  only.  This  decomposition  into  permanent  and  transitory  shocks 
is  not  typical  for  this  model,  but  can  be  done  in  general  as  part  of  the  so-called 
common  trend  representation  (see  Sect.  16.2.4). 

Finally,  we  will  simulate  the  reaction  of  the  system  to  a  unit  valued  shock  in  v,. 
Although  this  shock  only  has  a  temporary  influence  on  AY,,  it  will  have  a  permanent 
effect  on  the  level  Y,.  Taking  </>  =  0.8,  we  get  long-run  effect  (persistence)  of 
1/(1  —<f>)  =  5  as  explained  in  Sect.  7.1.3.  The  present  discounted  value  model  then 
implies  that  this  shock  will  also  have  a  permanent  effect  on  X,  too.  Setting  y  =  1, 
this  long-run  effectis  given  by  y(l— /))  /^(l— 0)-1  =  y/ (1 — </>)  =  5.  Because 

this  long-run  effect  is  anticipated  in  period  t,  the  period  of  the  occurrence  of  the 
shock,  X,  will  increase  by  more  than  one.  The  spread  turns,  therefore,  into  positive. 
The  error  correction  mechanism  will  then  dampen  the  effect  on  future  changes  of  X, 
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Fig.  16.1  Impulse  response  functions  of  the  present  discounted  value  model  after  a  unit  shock  to 
Y,  (y  =  l,p  =  0.9,0  =  0.8) 


so  that  the  spread  will  return  steadily  to  zero.  The  corresponding  impulse  responses 
of  both  variables  are  displayed  in  Fig.  16.1. 

Figure  16.2  displays  the  trajectories  of  both  variables  after  a  stochastic  simulation 
where  both  shocks  {ur\  and  j  vt  \  are  drawn  from  a  standard  normal  distribution. 
One  can  clearly  discern  the  non-stationary  character  of  both  series.  However,  as 
it  is  typically  for  cointegrated  series,  they  move  more  or  less  in  parallel  to  each 
other.  This  parallel  movement  is  ensured  by  the  error  correction  mechanism.  The 
difference  between  both  series  which  is  equal  to  the  spread  under  this  parameter 
constellation  is  mean  reverting  around  zero. 


1 6.2  Definition  and  Representation  of  Cointegrated  Processes 
16.2.1  Definition 

We  now  want  to  make  the  concepts  introduced  earlier  more  precise  and  give  a 
general  definition  of  cointegrated  processes  and  derive  the  different  representations 
we  have  seen  in  the  previous  section.  Given  an  arbitrary  regular  (purely  non- 
deterministic)  stationary  process  of  dimension  n.  n  >  1,  with  mean  zero 

and  some  distribution  for  the  starting  random  variable  X(),  we  can  define  recursively 
a  process  {W},  t  =  0, 1, 2, ...  as  follows: 


X,  —  fx,  +  X,~i  +  Ut, 


t=  1,2,... 
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Fig.  16.2  Stochastic  simulation  of  the  present  discounted  value  model  under  standard  normally 
distributed  shocks  (y  =  1.  fi  =  0.9,  (p  =  0.8) 


Thereby,  ji  denotes  an  arbitrary  constant  vector  of  dimension  n.  If  U,  ~  WN(0,  E), 
then  {X,}  is  a  multivariate  random  walk  with  drift  ji.  In  general,  however,  { U,  j  is 
autocorrelated  and  possesses  a  Wold  representation  U,  =  fiTLjZ,  (see  Sect.  14.1.1) 
such  that 

AX,  =  n  +  Ut  =  n  +  T* (L)Zr  =  fi  +  Z,  +  'I'lZf-i  +  'i,2Z,-2  +  . . . ,  (16.2) 

where  Z,  ~  WN(0,  E)  and  ||\P,-||2  <  oo  with  tko  =  /«•  We  now  introduce  the 
following  definitions. 

Definition  16.1.  A  regular  stationary  process  {Ut}  with  mean  zero  is  integrated  of 
order  zero,  /( 0),  if  and  only  if  it  can  be  represented  as 


U,  —  ^(L  )Zf  —  Z,  +  Zr_i  +  fZ,-2  +  •  •  ■ 

such  thatZ,  ~  WN(0,  E),  ^oJ'H^II  <  °°-  and  ^  °- 

Definition  16.2.  A  stochastic  process  {X,}  is  integrated  of  order  d,  1(d),  d  =  0, 
1,2,...,  if  and  only  if  Ad(X,  —  E(X,))  is  integrated  of  order  zero. 

In  the  following  we  concentrate  on  1(1)  processes.  The  definition  of  an  1(1) 
process  implies  that  {X,\  equals  X,  =  X,,  +  pt  +  E/=i  Uj  an^  is  thus  non¬ 
stationary  even  if  p  =  0.  The  condition  T(  I )  f  0  corresponds  to  the  one  in  the 
univariate  case  (compare  Definition  7.1  in  Sect.  7.1).  On  the  one  hand,  it  precludes 
the  case  that  a  trend- stationary  process  is  classified  as  an  integrated  process.  On  the 
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other  hand,  it  implies  that  {X,}  is  in  fact  non-stationary.  Indeed,  if  the  condition  is 
violated  so  that  'h(l)  =  0,  we  could  express  'I'(L)  as  (1  —  L)'F(L).  Thus  we  could 
cancel  1  —  L  on  both  sides  of  Eq.  (16.2)  to  obtain  a  stationary  representation  of 
{X,},  given  some  initial  distribution  for  Xq.  This  would  then  contradict  our  primal 
assumption  that  {X,}  is  non-stationary.  The  condition  ^°^07ll'^./'ll  <  oo  is  stronger 
than  £~0  ||^-||2  <  oo  which  follows  from  the  Wold’s  Theorem.  It  guarantees  the 
existence  of  the  Beveridge-Nelson  decomposition  (see  Theorem  16.1  below).2  In 
particular,  the  condition  is  fulfilled  if  {  Ut}  is  a  causal  ARMA  process  which  is  the 
prototypical  case. 

Like  in  the  univariate  case,  we  can  decompose  an  1(1)  process  additively  into 
several  components. 

Theorem  16.1  (Beveridge-Nelson  Decomposition).  If{Xt}  is  an  integrated  process 
of  order  one,  it  can  be  decomposed  as 

t 

Xt  =  Xq  +  pt  +  'T(l)  y  '  Zj  +  Vt, 
j=  i 

where  V,  =  'T(L)Zo  —  T* (L)Z,  with  'Ey  =  tp,-,  j  =  0, 1, 2, . . .  and  {V)} 

stationary. 

Proof.  Following  the  proof  of  the  univariate  case  (see  Sect.  7.1.4): 

'I'(L)  =  q/(l)+  (L-  l)'P(L) 
with  i  Thus, 

t  t 

X,  =  Xq  +  p,t  +  y  '  Uj  =  Xq  +  jit  +  £*003 
7=1  7=1 

t 

=  X0  +  pu  +  (®(1)  +  (L  -  1  ML))  Zj 

7=1 

t  t 

=  Xq  +  pa  +  W(l)  £Z;  +  £(L  -  1MDZ, 

7=1  7=1 

t 

=  Xq  +  p.t  +  ^(l)  y  Zj  4-  ^(LJZo  —  ^(L )Z/. 

i=i 

The  only  point  left  is  to  show  that  tp(L)Zo  —  'F(L)Z,  is  stationary.  Based 
on  Theorem  10.2,  it  is  sufficient  to  show  that  the  coefficient  matrices  are 


2This  condition  could  be  relaxed  and  replaced  by  the  condition  II  IP/ II 2  <  In  addition, 

this  condition  is  an  important  assumption  for  the  application  of  the  law  of  large  numbers  and  for 
the  derivation  of  the  asymptotic  distribution  (Phillips  and  Solo  1992). 
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absolutely  summable.  This  can  be  derived  by  applying  the  triangular  inequality 
and  the  condition  for  integrated  processes: 


OO 


Eli*,- 


OO 


E 


OO 


E*. 

i=j+ 1 


OO  OO 


^  E  E  ii*'- 


OO 

E-/ii*,ii  <  °°- 

7=1 


□ 


The  process  {X/}  can  therefore  be  viewed  as  the  sum  of  a  linear  trend, 
Xo  +  pt,  with  stochastic  intercept,  a  multivariate  random  walk,  VH(1)  ^_0Zf, 
and  a  stationary  process  {Vt}.  Based  on  this  representation,  we  can  then  define  the 
notion  of  cointegration  (Engle  and  Granger  1987). 

Definition  16.3  (Cointegration).  A  multivariate  stochastic  process  {X,}  is  called 
cointegrated  if  {X,}  is  integrated  of  order  one  and  if  there  exists  a  vector  f  €  M", 
ft  0,  such  that  {f'X,},  is  integrated  of  order  zero,  given  a  corresponding 
distribution  for  the  initial  random  variable  Xq.  ft  is  called  the  cointegrating  or 
cointegration  vector.  The  cointegrating  rank  is  the  maximal  number,  r,  of  linearly 
independent  cointegrating  vectors  .  ,fr.  These  vectors  span  a  linear  space 

called  the  cointegration  space. 

The  Beveridge-Nelson  decomposition  implies  that  f  is  a  cointegrating  vector 
if  and  only  if  /T'T(l)  =  0.  In  this  case  the  random  walk  component  f2j=  i  Zj  is 
annihilated  and  only  the  deterministic  and  the  stationary  component  remain.3  For 
some  issues  it  is  of  interest  whether  the  cointegration  vector  f  also  eliminates  the 
trend  component.  This  would  be  the  case  if  f'p  =  0.  See  Sect.  16.3  for  details. 

The  cointegration  vectors  are  determined  only  up  to  some  basis  transformations. 
If  /S  i , . . . ,  fir  is  a  basis  for  the  cointegration  space  then  ( fi  \ , .  f,)R  is  also  a 
basis  for  the  cointegration  space  for  any  nonsingular  r  x  /•  matrix  R  because 

m,...,pr)RYn  d  =  o. 


16.2.2  Vector  Autoregressive  (VAR)  and  Vector  Error 
Correction  Models  (VECM) 

Although  the  Beveridge-Nelson  decomposition  is  very  useful  from  a  theoretical 
point  of  view,  in  practice  it  is  often  more  convenient  to  work  with  alternative 
representations.  Most  empirical  investigations  of  integrated  processes  start  from  a 
VAR(p)  model  which  has  the  big  advantage  that  it  can  be  easily  estimated: 

X/  =  c  +  +  . . .  +  QpXf—p  +  Z„  Z,t  ~  WN(0,  E)  (16.3) 

where  <t>(L)  =  /„  —  dq  L  —  ...  —  <I>;, U‘  and  c  is  an  arbitrary  constant.  Subtracting 
X,-i  on  both  sides  of  the  difference  equation,  the  VAR  model  can  be  rewritten  as: 


3The  distribution  ofXo  is  thereby  chosen  such  that  fVXo  = 
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AX,  —  c  +  nx,_i  +  T]  AX,_i  +  . . .  +  r^-i  AX,_p+i  +  Z,  (16.4) 

where  n  =  —  <t>(  I )  =  — +  <f>  |  +  . . .  <t>;,  and  T,  =  —  Y^=i+i  ®j-  will  make  the 
following  assumptions: 

(i)  All  roots  of  the  polynomial  det  <F(z)  are  outside  the  unit  circle  or  equal  to  one, 
i.e. 


det  <h(z)  =  0 


|z|  >  1  or 

z  =  1, 


(ii)  The  matrix  IT  is  singular  with  rank  r,  1  <  r  <  n. 

(iii)  Rank(n)  =  Rank(n2). 


Assumption  (i)  makes  sure  that  {A,}  is  an  integrated  process  with  order  of  integration 
d  >  1.  Moreover,  it  precludes  other  roots  on  the  unit  circles  than  one.  The 
case  of  seasonal  unit  roots  is  treated  in  Hylleberg  et  al.  (1990)  and  Johansen  and 
Schaumburg  (1998). 4 5  Assumption  (ii)  implies  that  there  exists  at  least  n  —  r  unit 
roots  and  two  nxr  matrices  a  and  ft  with  full  column  rank  r  such  that 

n  =  afl. 

The  columns  of  jJ>  thereby  represent  the  cointegration  vectors  whereas  a  denotes 
the  so-called  loading  matrix.  The  decomposition  of  IT  in  the  product  of  a  and  (V 
is  not  unique.  For  every  non-singular  r  x  r  matrix  R  we  can  generate  an  alternative 
decomposition  If  =  a/T  =  {aR'~[  .  Finally,  assumption  (iii)  implies  that 

the  order  of  integration  is  exactly  one  and  not  greater.  The  number  of  unit  roots  is 
therefore  exactly  n  —  r ?  This  has  the  implication  that  <F(z)  can  be  written  as 


O(j)  =  U(z)M(z)V(z) 


where  the  roots  of  the  matrix  polynomials  U(z)  and  V(z)  are  all  outside  the  unit 
circle  and  where  M(z)  equals 


This  representation  of  <F(z)  is  a  special  form  of  the  Smith-McMillan  factorization  of 
polynomial  matrices  (see  Kailath  (1980)  and  Yoo  (1987)).  This  factorization  isolates 
the  unit  roots  in  one  simple  matrix  so  that  the  system  can  be  analyzed  more  easily. 


4The  seasonal  unit  roots  are  the  roots  of  z®  —  1  =  0  where  s  denotes  the  number  of  seasons.  These 
roots  can  be  expressed  as  cos(2kn/s)  +  i  sm(2kjr /s),  k  =  0,  1, . . . ,  s  —  1. 

5For  details  see  Johansen  (1995),  Neusser  (2000)  and  Bauer  and  Wagner  (2003). 
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These  assumptions  will  allow  us  to  derive  from  the  VAR(p)  model  several  repre¬ 
sentations  where  each  of  them  brings  with  it  a  particular  interpretation.  Replacing  II 
by  aft'  in  Eq.  (16.4),  we  obtain  the  vector  error  correction  representation  or  vector 
error  correction  model  (VECM): 


AX,  —  c  +  aft'Xt- 1  +  T]  AX,_i  +  . . .  +  r,,_i  AX,_p+i  +  Zt.  (16.5) 


Multiplying  both  sides  of  the  equation  by  (a' a)  xa’  and  solving  for  ft'Xt- i,  we  get: 


a  has  full  column  rank  r  so  that  a  a  is  a  non-singular  r  x  r  matrix.  As  the  right  hand 
side  of  the  equation  represents  a  stationary  process,  also  the  left  hand  side  must  be 
stationary.  This  means  that  the  /  -dimensional  process  {ft'Xt-\ }  is  stationary  despite 
the  fact  that  {X,}  is  integrated  and  has  potentially  a  unit  root  with  multiplicity  n. 

The  term  error  correction  was  coined  by  Davidson  et  al.  (1978).  They  interpret 
the  mean  of  ft'X,,  p*  =  E ft'X,,  as  the  long-run  equilibrium  or  steady  state  around 
which  the  system  fluctuates.  The  deviation  from  equilibrium  (error)  is  therefore 
given  by  ft  'X,-  \—  p* .  The  coefficients  of  the  loading  matrix  a  should  then  guarantee 
that  deviations  from  the  equilibrium  are  corrected  over  time  by  appropriate  changes 
(corrections)  in  X,. 

An  Illustration 

To  illustrate  the  concept  of  the  error  correction  model,  we  consider  the  following 
simple  system  with  a  =  (a\ . a2)',  a\  7^  a.2,  and  ft  =  (1,-1)'.  For  simplicity, 
we  assume  that  the  long-run  equilibrium  p*  is  zero.  Ignoring  higher  order  lags,  we 
consider  the  system: 


AXp  —  «i  (A|,_i  —  ^2,/-t)  +  Z\t 

AX2 1  =  oi2(Xi,t-i  —X2j- 1)  +  Z2t. 


The  autoregressive  polynomial  of  this  system  is: 


The  determinant  of  this  polynomial  is  det<b(z)  =  1  —  (2  +  oq  —  a2)z  +  (1  + 
a  1  —  a2)z2  with  roots  equal  to  z  =  1  and  z  =  1/(1  +  a\  —  a2).  This  shows  that 


assumption  (i)  is  fulfilled.  As  IT  = 
implies  assumption  (ii).  Finally, 


C 


0i\  —d\ 

a2  ot2 
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n2  _  (  af-aia2  -af  +  aia2\ 

V-q'j  +  q'iQ'2  a\-a.\a2  )' 

Thus,  the  rank  of  If2  is  also  one  because  aq  ^  a2.  Hence,  assumption  (iii)  is  also 
fulfilled. 

We  can  gain  an  additional  insight  into  the  system  by  subtracting  the  second 
equation  from  the  first  one  to  obtain: 

X\t  —  X2t  =  (1  +  ai  —  a2)(Xi,-i  —  X2',-i)  +  Z\t  —  Z2t. 

The  process  fi'X,  =  X\,  —  X2t  is  stationary  and  causal  with  respect  to  Z\,—Z2t  if  and 
only  if  |1  +  a\  —  a2\  <  1,  or  equivalently  if  and  only  if  —2  <  a\  —  a2  <0.  Note 
the  importance  of  the  assumption  that  a.\  ^  a2.  It  prevents  that  X\t  —  X2t  becomes  a 
random  walk  and  thus  a  non-stationary  (integrated)  process.  A  sufficient  condition 
is  that  —1  <  oi\  <0  and  0  <  a2  <  1  which  imply  that  a  positive  (negative)  error, 
i.e.  X\t-\  —  X2  ,-\  >  0(<  0),  is  corrected  by  a  negative  (positive)  change  in  X\,  and 
a  positive  (negative)  change  in  X2t.  Although  the  shocks  Z\,  and  Z2t  push  X\,  —  X2t 
time  and  again  away  from  its  long-run  equilibrium,  the  error  correction  mechanism 
ensures  that  the  variables  are  adjusted  in  such  a  way  that  the  system  moves  back  to 
its  long-run  equilibrium. 


16.2.3  The  Beveridge-Nelson  Decomposition 


We  next  want  to  derive  from  the  VAR  representation  a  causal  representation  or 
MA(oo)  representation  for  {AA,}.  In  contrast  to  a  normal  causal  VAR  model, 
the  presence  of  unit  roots  precludes  the  simple  application  of  the  method  of 
undetermined  coefficients,  but  requires  an  additional  effort.  Multiplying  the  VAR 
representation  in  Eq.  (16.3),  <J>(L)X,  =  £/(L)M(L)  V(L)Z,  =  c  +  Z,,  from  the  left 
by  f/_1(L)  we  obtain: 

M(L)V(L)X,  =  U~\\)c+U~l(L)Zt. 


Multiplying  this  equation  by  M(L) 


(In-r  0  \ 

l  0  (1-L)/J 


leads  to: 


V(L)AX,  =  M(l)t/_1(l)c  +  M(L)[/_1(L)Z, 


which  finally  leads  to 

AX,  =  V-1(l)M(l)[/-1(l)c  +  V_1(L)M(L)[/_1(L)Z, 
=  fi  +  ^(L)Zr. 


This  is  the  MA(oo)  representation  of  { AA,}  and  corresponds  to  Eq.  (16.2). 
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Because  n  =  —  <t>(  1 )  =  —  £/(l)M(l)V(l),  the  following  relation  holds  for  the 
partitioned  matrices: 


<5(11  =  (Un(l)  Un(l))  f°  (Vu(l)  Vl2(1)>l 
\u2i(i)  u22(i)J  \0 1,)  VV21(1)  V22(1  y 

This  implies  that  we  can  define  a  and  ft  as 


a  =  — 


(Un(iy\ 

\U22(l)J 


and  ft  = 


17(1)  and  V(l)  are  non-singular  so  that  a  and  f  have  full  column  rank  r.  Based  on 
this  derivation  we  can  formulate  the  following  lemma. 


Lemma  16.1.  The  columns  of  the  so-defined  matrix  ft  are  the  cointegration  vectors 
for  the  process  {Z}.  The  corresponding  matrix  of  loading  coefficients  is  a  which 
fulfills  =  0. 


Proof.  We  must  show  that  /f'T^l)  =  0  which  is  the  defining  property  of  cointe¬ 
gration  vectors.  Denoting  by  ( V®(l));,j=i,2  the  appropriately  partitioned  matrix  of 
V(l)-\  we  obtain: 

'  1  ’  W"JI'(1)  Vl2lU  1 1 / '  0  o)u  1 

=  («i)vrf))(;;®:)  u-H  1) 

=  (v2i(l)V(11)(l)  +  V22(1)V(21)(1)  :  o)  ^O)  =  °» 

where  the  last  equality  is  a  consequence  of  the  property  of  the  inverse  matrix. 

With  the  same  arguments,  we  can  show  that  'T(  1  )a  =  0.  □ 


The  equivalence  between  the  VEC  and  the  MA  representation  is  known  as 
Granger’s  representation  theorem  in  the  literature.  Granger’s  representation  theorem 
immediately  implies  the  Beveridge-Nelson  decomposition: 

i 

Xt=Xo  +  m(\)ct+y(\)Y^,Zj  +  Vt  (16.6) 

2=1 

t 

=  Zo  +  V~\\)M(\)U~\\)c  t  +  V-1(l)M(l)t/-1(l)^Zj  +  V,  (16.7) 

7=1 

where  the  stochastic  process  {V,}  is  stationary  and  defined  as  Vt  =  ^(LJZo  — 
(k(L)Z,  with  %  =  E~  +1  and  'P(L)  =  V~x  {V)M(L)U-X  { L).  As  ^(1)  = 
j6'y_1(l)M(l)l/_1(l)  =  0,  ft  eliminates  the  stochastic  trend  (random  walk), 
Ylj=i  Zo  as  well  as  the  deterministic  linear  trend  jit  =  V~x  ( I  )M(  I  )U~l  ( I  )c  t. 
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An  interesting  special  case  is  obtained  when  the  constant  c  is  a  linear  combination 
of  the  columns  of  a,  i.e.  if  there  exists  a  vector  g  such  that  c  =  ag.  Under  this 
circumstance,  tlt(l)c  =  4/(  l  )ag  =  0  and  the  linear  trend  vanishes  and  we  have 
EAA,  =  0.  In  this  case,  the  data  will  exhibit  no  trend  although  the  VAR  model 
contains  a  constant.  A  similar  consideration  can  be  made  if  the  VAR  model  is 
specified  to  contain  a  constant  and  a  linear  time  trend  d  t.  The  Beveridge-Nelson 
decomposition  would  then  imply  that  the  data  should  follow  a  quadratic  trend. 
However,  in  the  special  case  that  d  is  a  linear  combination  of  the  columns  of  a, 
the  quadratic  trend  disappears  and  only  the  linear  remains  because  of  the  constant. 


16.2.4  Common  Trend  and  Triangular  Representation 

The  'T(l)  in  the  Beveridge-Nelson  decomposition  is  singular.  This  implies  that  the 
multivariate  random  walk  'I'(l)  Z,  does  not  consist  of  n  independent  univariate 

random  walks.  Instead  only  n  —  r  independent  random  walks  make  up  the  stochastic 
trend  so  that  { X, }  is  driven  by  n  —  r  stochastic  trends.  In  order  to  emphasize  this  fact, 
we  derive  from  the  Beveridge-Nelson  decomposition  the  so-called  common  trend 
representation  (Stock  and  Watson  1988a). 

As  T' ( 1 )  has  rank  n  —  r.  there  exists  a  n  x  r  matrix  y  such  that  '\){\)y  =  0.  Denote 
by  y1-  the  n  x  (n  —  r)  matrix  whose  columns  are  orthogonal  to  y,  i.e.  y'y~  =  0.  The 
Beveridge-Nelson  decomposition  can  then  be  rewritten  as: 

Xf  =  Xq  +  'k(l)  ^y-L  ;  y^  (y-*-  :  y)  c  t 

_!  t 

+  ^(1)  (yx  ;  y)  (yx  ;  y)  ^ 

J=1 

=  +  ('T(l)y-*-  :  o)  (y-1  :  y)  ct 

_i  t 

+  (^(Uy-1  :  o)  (y-1  :  y)  J2zi  +  V> 

j=  i 

t 

=  xo  +  (^(l)y-1-  :  o)  ct  +  ('I'(l)y-L  :  o)  ^2ZJ  +  V> 

7=1 

where  c  =  ^_L  :  y\  c  and  Z;  =  ^_L  :  y\  Zj.  Therefore,  only  the  first 

n  —  r  elements  of  the  vector  c  are  relevant  for  the  deterministic  linear  trend.  The 
remaining  elements  are  multiplied  by  zero  and  are  thus  irrelevant.  Similarly,  for 
the  multivariate  random  walk  only  the  first  n  —  r  elements  of  the  process  {Z,}  are 
responsible  for  the  stochastic  trend.  The  remaining  elements  of  Z,  are  multiplied 
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by  zero  and  thus  have  no  permanent,  but  only  a  transitory  influence.  The  above 
representation  decomposes  the  shocks  orthogonally  into  permanent  and  transitory 
ones  (Gonzalo  and  Ng  2001).  The  previous  lemma  shows  that  one  can  choose  for  y 
the  matrix  of  loading  coefficients  a. 

Summarizing  the  first  n  —  r  elements  of  c  and  Z,  to  ci  and  Z\t,  respectively,  we 
arrive  at  the  common  trend  representation: 


X t  —  Xo  +  Bc\  t  +  B  y  '  Z\j  +  V, 


where  the  n  x  (n  —  r)  matrix  B  is  equal  to  T  ( I )  y  ^ . 

Applying  these  results  to  our  introductory  example,  we  arrive  at 


B  = 


This  again  demonstrates  that  the  trend,  the  linear  as  well  as  the  stochastic  trend, 
are  exclusively  stemming  from  the  nonstationary  variables  {Yt}  (compare  with 
Eq.  (16.1)). 

Finally,  we  want  to  present  a  triangular  representation  which  is  well  suited  to 
deal  with  the  nonparametric  estimation  approach  advocated  by  Phillips  (1991)  and 
Phillips  and  Hansen  (1990)  (see  Sect.  16.4).  In  this  representation  we  normalize 
the  cointegration  vector  such  /l  =  (/,-.  —//)'.  In  addition,  we  partition  the  vector  X, 
into  X\,  and  Xu  such  that  X\,  contains  the  first  r  and  Xo,  the  last  n  —  r  elements. 
X,  =  (X'u,  Xo,)'  can  then  be  expressed  as: 


X\ ,  —  b'Xot  +  Jt\D,  +  u\t 
XXo  t  =  itoXD,  +  uot 


(16.8a) 

(16.8b) 


where  D,  summarizes  the  deterministic  components  such  as  constant  and  linear 
time  trend.  {u\,}  and  {uo,}  denote  potentially  autocorrelated  and  cross-correlated 
stationary  time  series. 


1 6.3  Johansen's  Test  for  Cointegration 

In  Sect.  7.5.2  we  have  already  discussed  a  regression  based  test  for  cointegration 
among  two  variables.  It  was  based  on  a  unit  root  of  the  residuals  from  a  bivari¬ 
ate  regression  of  one  variable  against  the  other.  In  this  regression,  it  turned  out  to 
be  irrelevant  which  of  the  two  variables  was  chosen  as  the  regressor  and  which 
one  as  the  regressand.  This  method  can,  in  principle,  be  extended  to  more  than 
two  variables.  However,  with  more  than  two  variables,  the  choice  of  the  regressand 
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becomes  more  crucial  as  not  all  variables  may  be  part  of  the  cointegrating  relation. 
Moreover,  more  than  one  independent  cointegrating  relation  may  exist.  For  these 
reasons,  it  is  advantageous  to  use  a  method  which  treats  all  variables  symmetrically. 
The  cointegration  test  developed  by  Johansen  fulfills  this  criterion  because  it  is 
based  on  a  VAR  which  does  not  single  out  a  particular  variable.  This  test  has 
received  wide  recognition  and  is  most  often  used  in  practice.  The  test  serves  two 
purposes.  First,  we  want  to  determine  the  number  r  of  cointegrating  relationships. 
Second,  we  want  to  test  properties  of  the  cointegration  vector  /3  and  the  loading 
matrix  a. 

The  exposition  of  the  Johansen  test  follows  closely  the  work  of  Johansen  where 
the  derivations  and  additional  details  can  be  found  (Johansen  1988, 1991,  1995).  We 
start  with  a  VAR(p)  model  with  constant  c  in  VEC  form  (see  Eq.  (16.4)): 

AX,  =  c  +  nx,_i  +  Ti  AX,-\  +  . . .  +  Tp-XAX,-p+l  +  Zf, 

t  =  1,2, ...  ,T  (16.9) 

where  Z,  ~  IIDN(0,  E)  and  given  starting  values  Xq  =  xo . _X_p+  \  =  X-p+i. 

The  problem  can  be  simplified  by  regressing  AX,  as  well  as  X,-i  against 
c,  AX,-U  . . . ,  AX,-p+i  and  working  with  the  residuals  from  these  regressions.'1 
This  simplification  results  in  a  VAR  model  of  order  one.  We  therefore  start  our 
analysis  without  loss  of  generality  with  a  VAR(l)  model  without  constant  term: 


AX,  =  IK,.]  +  Z, 


where  Z,  ~  IIDN(0,  E).6 7 

The  phenomenon  of  cointegration  manifests  itself  in  the  singularity  of  the 
matrix  n .  In  particular,  we  want  to  determine  the  rank  of  n  which  gives  the  number 
of  linearly  independent  cointegrating  relationships.  Denoting  by  r ,  the  rank  of  IT, 
0  <  r  <  n, ,  we  can  formulate  a  sequence  of  hypotheses: 

H(r)  :  rank(Fl)  <  r,  r=0,  l,...,n. 

Hypothesis  H(r),  thus,  implies  that  there  exists  at  most  r  linearly  independent 
cointegrating  vectors.  The  sequence  of  hypotheses  is  nested  in  the  sense  that  H(r) 
implies  H(r  +1): 


H(0)  c  H(l)  c  ...  c  H(n). 

The  hypothesis  H(0)  means  that  rank(Fl)  =  0.  In  this  case,  IT  =  0  and  there  are 
no  cointegration  vectors.  { X, }  is  thus  driven  by  n  independent  random  walks  and 


6If  the  VAR  model  (16.9)  contains  further  deterministic  components  besides  the  constant,  these 
components  have  to  be  accounted  for  in  these  regressions. 

7This  two-stage  least-squares  procedure  is  also  known  as  partial  regression  and  is  part  of  the  Frisch- 
Waugh-Lowell  Theorem  (Davidson  and  MacKinnon  1993;  19-24). 
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the  VAR  can  be  transformed  into  a  VAR  model  for  {AX,}  which  in  our  simplified 
version  just  means  that  AX,  =  Z,  ~  IIDN(0,  E).  The  hypothesis  H(n)  places  no 
restriction  on  II  and  includes  in  this  way  the  case  that  the  level  of  {X,}  is  already 
stationary.  Of  particular  interest  are  the  hypotheses  between  these  two  extreme  ones 
where  non-degenerate  cointegrating  vectors  are  possible.  In  the  following,  we  not 
only  want  to  test  for  the  number  of  linearly  independent  cointegrating  vectors,  r, 
but  we  also  want  to  test  hypotheses  about  the  structure  of  the  cointegrating  vectors 
summarized  in  /i. 

Johansen’s  test  is  conceived  as  a  likelihood-ratio  test.  This  means  that  we  must 
determine  the  likelihood  function  for  a  sample  where  T  denotes  the 

sample  size.  For  this  purpose,  we  assume  that  {Z,}  ~  IIDN(0,  E)  so  that  logged 
likelihood  function  of  the  parameters  a,  /3,  and  E  conditional  on  the  starting  values 
is  given  by  : 


Tn  T 

l{a,  fS,  E)  =  — —  ln(27r)  +  —  In  det(E_1 ) 

1  T 

-  -  J2(AX<  ~  aP'Xf-i)' S_1(AX,  -  afi'Xt-i) 

t=\ 

where  II  =  aft'.  For  a  fixed  given  ft,  a  can  be  estimated  by  a  regression  of  AX, 
on  /}'X,_i : 

a  =  HP)  =  S0lp(P'SnPr] 

where  the  moment  matrices  Soo.  Sn,  5oi  and  5io  are  defined  as: 

1  T 

500  =  -  J^(AXt)(AXt)f 

t=  1 
1  T 

Sn  =  -EXhXh 
1=1 

1  T 

501  =  j  E(AX,)X;_, 

1=1 

•Sio  =  •S'oi- 

The  covariance  matrix  of  the  residuals  then  becomes: 

S  =  E(j8)  =  Soo  -  S0iP(ft'SnPrl * * * SP%o. 


Using  these  results,  we  can  concentrate  the  log-likelihood  function  to  obtain: 
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, / 1  j  I  ^  Til 

UP)  =  l(a{P),  P,  £(/}))  =  ln(2jr )  -  -  lndet(£(/i))  -  — 

Tvl  Tn  T 

=  — -  ln(27r)  -  —  -  -  lndet  (S0o  -  Smp{p' SnPTl  P' Sw)  .  (16.10) 

The  expression  -S  in  the  above  equation  is  derived  as  follows: 

1  T 

-  J2(AX‘  -  -  aP'Xt-i) 

Z  /=  1 

=  ^  tr  -  &P'X>~ i)(A^  -  oiP'Xt-i)'fi-1  j 

=  \  tr  ((TS oo  -  Ta.p'S\o  -  TSmpa'  +  ra^Sn^')^1) 


T 

=  -  tr 
2 


/ 


(•Soo  —  aP'Sio)  S 


-l 


V 


=  £ 


/ 


Tn 

T' 


The  log-likelihood  function  is  thus  maximized  if 

det  (£06))  =  det(S00-S0lP(P'Snprlp'S,0) 


=  det  Si 


det  (P'iSn-SwS^SoriP) 

det(P'SuP) 


oo- 


is  minimized  over  p.s  The  minimum  is  obtained  by  solving  the  following  general¬ 
ized  eigenvalue  problem  (Johansen  1995): 

det(ASn  —  SioSooSoi)  =  0. 

This  eigenvalue  problem  delivers  n  eigenvalues 


1  >  Ai  >  A2  >  . . .  >  A„  >  0 


8Thereby  we  make  use  of  the  following  equality  for  partitioned  matrices: 

det  ( 8 * *  1 1  ^  J  =  det  A  j  i  det(A22 — A2iA^j^Aj2)  =  detA22  det(Au  — Aj2A^2^A2i) 

\A2i  A22/ 

where  An  and  A22  are  invertible  matrices  (see  for  example  Meyer  2000;  p.  475). 
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with  corresponding  n  eigenvectors  f}\, .  These  eigenvectors  are  normalized 
such  that  P'Sn/3  =  I„.  Therefore  we  have  that 


argmin^  det(E()3))  =  (detS0o)  f~[  A;- 

i=l 

Remark  16.1.  In  the  case  of  cointegration,  n  is  singular  with  rankll  =  r.  To 
estimate  r,  it  seems  natural  to  investigate  the  number  of  nonzero  eigenvalues  of 
II  =  SoiSn  .  However,  because  eigenvalues  may  be  complex,  it  is  advantageous 
not  to  investigate  the  eigenvalues  of  II  but  those  of  n'll  which  are  all  real  and 
positive  due  to  the  symmetry  of  IT  fl .  These  eigenvalues  are  called  the  singular 
values  of  fl.1  Noting  that 


0  =  det  (ASn  —  Sio-Soo-Soi)  =  detSu  det  ^A In  —  5'111^25'io>S,0015oi5'111^  ^ 

=  detS„  det  (a/„  -  (5“1/2,Soi5711/2),(>S“01/2Soi5“1/2))  , 

the  generalized  eigenvalue  problem  above  therefore  just  determines  the  singular 
values  ofS~1/2S0iS~1/2  =  .S’“'/2 

Remark  16.2.  Based  on  the  observation  that,  for  n  =  1,  A  =  1  1 11  equals  the 

*311*300  ^ 

squared  empirical  correlation  coefficient  between  AX,  and  X,-\ ,  we  find  that  the 
eigenvalues  A j,  j  =  1, . . . ,  n,  are  nothing  but  the  squared  canonical  correlation 
coefficients  (see  Johansen  1995;  Reinsel  1993).  Thereby,  the  largest  eigenvalue, 
Ai,  corresponds  to  the  largest  squared  correlation  coefficient  that  can  be  achieved 
between  linear  combinations  of  AX\  and  X,-\ .  Thus  fl i  gives  the  linear  combination 
of  the  integrated  variable  X,-\  which  comes  closest  in  the  sense  of  correlation 
to  the  stationary  variable  {AXf}.  The  second  eigenvalue  A2  corresponds  to  the 
maximal  squared  correlation  coefficient  between  linear  combinations  of  AX,  and 
which  are  orthogonal  to  the  linear  combination  corresponding  to  Ai.  The 
remaining  squared  canonical  correlation  coefficients  are  obtained  by  iterating  this 
procedure  n  times. 

If  the  dimension  of  the  cointegrating  space  is  r  then  fl  consists  of  those  eigen¬ 
vectors  which  correspond  to  the  r  largest  eigenvalues  A 1 . A,-.  The  remaining 

eigenvalues  Ar+i, . . . ,  A„  should  be  zero.  Under  the  null  hypothesis  H(r),  the  log- 
likelihood  function  (16.10)  can  be  finally  expressed  as: 


9An  appraisal  of  the  singular  values  of  a  matrix  can  be  found  in  Strang  (1988)  or  Meyer  (2000). 
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„  Tn  Tn  T  T  ,r 

Z($r)  =  — ~  Intr  -  —  -  lndetS0o  -  -  ln0  ~  X')- 

i=  1 

The  expression  for  the  optimized  likelihood  function  can  now  be  used  to 
construct  the  Johansen  likelihood-ratio  test.  There  are  two  versions  of  the  test 
depending  on  the  alternative  hypothesis: 

trace  test:  Ho  :  H(r)  against  H(n), 
max  test:  Ho  :  H(r)  against  H(r  +  1). 

The  corresponding  likelihood  ratio  test  statistics  are  therefore: 

n  n 

trace  test:  2{l(fin)  —  l(fir))  =  —T  ^  ln(l  —  A ;)  Ri  T  ^  Ay, 

j=r+ 1  j=r+ 1 

max  test:  2(l(j3r+\)  —l(fir))  =  — Tln(l  —  Ar+i)  Ri  TA,.+  i. 

In  practice  it  is  useful  to  adopt  a  sequential  test  strategy  based  on  the  trace  test.  Given 
some  significance  level,  we  test  in  a  first  step  the  null  hypothesis  H(0)  against  H(n). 
If,  on  the  one  hand,  the  test  does  not  reject  the  null  hypothesis,  we  conclude  that 
r  =  0  and  that  there  is  no  cointegrating  relation.  If,  on  the  other  hand,  the  test  rejects 
the  null  hypothesis,  we  conclude  that  there  is  at  least  one  cointegrating  relation.  We 
then  test  in  a  second  step  the  null  hypothesis  H(l)  against  H(n).  If  the  test  does  not 
reject  the  null  hypothesis,  we  conclude  that  there  exists  one  cointegrating  relation, 
i.e.  that  r  =  1 .  If  the  test  rejects  the  null  hypothesis,  we  examine  the  next  hypothesis 
H(2),  and  so  on.  In  this  way  we  obtain  a  test  sequence.  If  in  this  sequence,  the  null 
hypothesis  H(r)  is  not  rejected,  but  H(r  +  1)  is,  we  conclude  that  exist  r  linearly 
independent  cointegrating  relations  as  explained  in  the  diagram  below. 

rejection  rejection 

H(0)  against  H(n)  - >  H(l)  against  H(/z)  - >  H(2)  against  H(«) .. . 


r = 0  r = 1  r =2 

If  in  this  sequence  we  do  not  reject  H(r)  for  some  r,  it  is  useful  to  perform  the  max 
test  H(r)  against  H ( r  +  1 )  as  a  robustness  check.  The  asymptotic  distributions  of  the 
test  statistics  are,  like  in  the  Dickey-Fuller  unit  root  test,  nonstandard  and  depend  on 
the  specification  of  the  deterministic  components. 
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16.3.1  Specification  of  the  Deterministic  Components 

As  mentioned  previously,  the  asymptotic  distribution  of  Johansen’s  test  depends  on 
the  specification  of  the  deterministic  components.  Thus,  some  care  must  be  devoted 
to  this  issue.  We  illustrate  this  point  by  decomposing  the  model  additively  into  a 
linear  deterministic  and  a  stochastic  component  in  vector  error  correction  form  (see 
Johansen  (1995;  80-84)  and  Liitkepohl  (2006;  section  6.4)): 


X, —  /To  +  /Tif  +  Y,  (16.11) 

AY,  =  nrf_i  +  Z,  =  afi'Y +  Z,.  (16.12) 

For  the  ease  of  exposition,  we  have  omit  the  autoregressive  corrections.  Eliminating 
Y,  using  Y,  =  X,  —  jio  —  j±\t  and  AY,  =  AX,  —  ji\  leads  to 


AX,  —  Hi  =  aft  (X,_i  —  /r0  —  /Xi(f  —  1  ))+Z,. 


This  equation  can  be  rewritten  as 


AX,  =  co  +  ci(f-  1)  +  aftX,- 1  +  Z,  (16.13) 


with  co  =  /Ti  —  ai S'/Tq  and  ci  =  —aftj. Ti 


=  co  +  a{ft,  -jSVOA®.!  +  z,  (16.14) 

where  Xf°  =  (X'r  i)’ .  Equation  (16. 13)  is  just  the  vector  error  correction  model  (16.4) 
augmented  by  the  linear  trend  term  c\t.  If  the  term  c i  would  be  left  unrestricted 
arguments  similar  to  those  in  Sect.  16.2.3  would  show  that  X,  exhibits  a  determinis¬ 
tic  quadratic  trend  with  coefficient  vector  'Fjljci.  This,  however,  contradicts  the 
specification  in  Eq.  (16.11).  However,  if  we  recognize  that  c i  in  Eq.  (16.13)  is 
actually  restricted  to  lie  in  the  span  of  a ,  i.e.  that  c i  =  ay \  with  y \  =  — /?'/ n, 
no  quadratic  trend  would  emerge  in  the  levels  because  Til  )a  =  0  by  Granger’s 
representation  Theorem  16.1.  Alternatively,  one  may  view  the  time  trend  as  showing 
up  in  the  error  correction  term,  respectively  being  part  of  the  cointegrating  relation, 
as  in  Eq.  (16.14). 

Similarly,  one  may  consider  the  case  that  X,  has  a  constant  mean  /To,  i.e.  that 
/r i  =  0  in  Eq.  (16.1 1).  This  leads  to  the  same  error  correction  specification  (16.13), 
but  without  the  term  c\t.  Leaving  the  constant  co  unrestricted,  this  will  generate  a 
linear  trend  4r(l)co/  as  shown  in  Sect.  16.2.3.  In  order  to  reconcile  this  with  the 
assumption  of  a  constant  mean,  we  must  recognize  that  co  =  ayo  with  yo  =  —  /T/ro. 
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Table  16.1  Trend  specifications  in  vector  error  correction  models 


Case 

Deterministic  term 
in  VECM  or  VAR 

Restriction 

Trend  in  X, 

EAX, 

m%) 

I 

None 

- 

Zero 

Zero 

Zero 

II 

Co 

CO  =  oryo 

Constant 

Zero 

Constant 

III 

Co 

None 

Linear 

Constant 

Constant 

IV 

Co  +  Cif 

ci  =  ay  i 

Linear 

Constant 

Linear 

V 

Co  +  Cl  f 

None 

Quadratic 

Linear 

Linear 

Table  inspired  by  Johansen  (2007) 


Based  on  these  arguments,  we  can  summarize  the  discussion  by  distinguishing 
five  different  cases  displayed  in  Table  16.1. 10  This  table  also  shows  the  implications 
for  EAX,  and  E(/TX,).  These  can  read  off  from  Eqs.  (16.13)  and  (16.14). 

The  corresponding  asymptotic  distributions  of  the  trace  as  well  as  the  max  test 
statistic  in  these  five  cases  are  tabulated  in  Johansen  (1995),  MacKinnon  et  al. 
(1999),  and  Osterwald-Lenum  (1992).*  11  The  finite  sample  properties  of  theses  tests 
can  be  quite  poor.  Thus,  more  recently,  bootstrap  methods  have  been  proven  to 
provide  a  successful  alternative  in  practice  (Cavaliere  et  al.  2012). 


16.3.2  Testing  Hypotheses  on  Cointegrating  Vectors 

As  mentioned  previously,  the  cointegrating  vectors  are  not  unique,  only  the  coin¬ 
tegrating  space  is.  This  makes  the  cointegrating  vectors  often  difficult  to  interpret 
economically,  despite  some  basis  transformation.  It  is  therefore  of  interest  to  see 
whether  the  space  spanned  by  the  cointegrating  vectors  summarized  in  the  columns 
of  (i  can  be  viewed  as  a  subspace  spanned  by  some  hypothetical  vectors  H  = 

(hi . hs),  r  <  s  <  n.  If  this  hypothesis  is  true,  the  cointegrating  vectors  should 

be  linear  combinations  of  the  columns  of  H  so  that  the  null  hypothesis  can  be 
formulated  as 


H  0:P=H<p  (16.15) 

for  some  s  x  r  matrix  (p.  Under  this  null  hypothesis,  this  amounts  to  solve  an 
analogous  general  eigenvalue  problem: 

det  ((pH'SuH  —  H'SioSqqSqiH)  =  0. 

The  solution  of  this  problem  is  given  by  the  eigenvalues  1>Ai>A2>...Aj>0 
with  corresponding  normalized  eigenvectors.  The  likelihood  ratio  test  statistic  for 
this  hypothesis  is  then 


10It  is  instructive  to  compare  theses  cases  to  those  of  the  unit  root  test  (see  Sect.  7.3.1). 

11  The  tables  by  MacKinnon  et  al.  (1999)  allow  for  the  possibility  of  exogenous  integrated  variables. 
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j=i 


1 zk 

i-V 


This  test  statistic  is  asymptotically  distributed  as  a  / 2  distribution  with  r(n  —  .v) 
degrees  of  freedom. 

With  similar  arguments  it  is  possible  to  construct  a  test  of  the  null  hypothesis  that 
the  cointegrating  space  spanned  by  the  columns  of  ft  contains  some  hypothetical 
vectors  K  =  (hi hs ) ,  1  <  .v  <  r.  The  null  hypothesis  can  then  be  formulated  as 


H0  :  Kip  =  0 


(16.16) 


for  some  s  x  r  matrix  <p.  Like  in  the  previous  case,  this  hypothesis  can  also  be 
tested  by  the  corresponding  likelihood  ratio  test  statistic  which  is  asymptotically 
distributed  as  a  y1  distribution  with  s(n  —  r )  degrees  of  freedom.  Similarly,  it  is 
possible  to  test  hypotheses  on  a  and  joint  hypotheses  on  a  and  [1  (see  Johansen 
1995;  Kunst  and  Neusser  1990;  Liitkepohl  2006). 


1 6.4  Estimation  and  Testing  of  Cointegrating  Relationships 

Johansen’s  approach  has  become  very  popular  because  it  presents  an  integrated 
framework  for  testing  and  estimating  cointegrating  relationships  based  on  the 
maximum  likelihood  method.  However,  it  requires  the  specification  of  a  concrete 
VAR  model.  This  proves  sometimes  difficult  it  practice,  especially  when  the  true 
data  generating  process  is  not  purely  autoregressive.  Similar  to  the  Phillips-Perron 
test  discussed  in  Sect.  7.3.2,  Phillips  and  Hansen  (1990)  propose  a  nonparametric 
approach  for  the  estimation  and  hypothesis  testing  of  cointegrating  relationships. 
This  approach  is  especially  appropriate  if  the  long-run  relationships  are  the  prime 
objective  of  the  investigation  as  f.e.  in  Neusser  and  Kugler  (1998). 

The  Phillips  and  Hansen  approach  is  based  on  the  triangular  representation 
of  cointegrated  processes  given  in  the  equation  system  (16.8).  Thereby  the  r 
cointegration  vectors  are  normalized  such  that  /l  =  ( lr ,  —/;')'  where  h  is  the 
regression  coefficient  matrix  from  a  regression  of  X\,  on  Xu  controlling  for 
deterministic  components  D,  (see  Eq.  (16.8a)).12  The  least-squares  estimate  of  b 
is  (super)  consistent  as  already  noted  in  Sect.  7.5.2.  However,  the  estimator  is 
not  directly  suitable  for  hypothesis  testing  because  the  conventional  test  statistics 
do  not  have  the  usual  asymptotic  distributions.  The  idea  of  Phillips  and  Hansen 
(1990)  is  to  correct  the  conventional  least-squares  estimates  to  account  for  serial 
correlation  and  for  the  endogeneity  arising  from  the  cointegrating  relationship.  This 
leads  to  the  fully-modified  ordinary  least-squares  estimator  (FMOLS  estimator). 


12The  choice  of  the  variables  used  for  normalization  turns  out  to  be  important  in  practice.  See  the 
application  in  Sect.  16.5. 
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As  the  endogeneity  shows  up  in  the  long-run  correlation  between  the  variables,  the 
proposed  modification  uses  of  the  long-run  variance  J  of  ut  =  ( u\  t ,  u2l)' .  According 
to  Sect.  11.1  this  entity  defined  as: 


J  = 


r(/j)  =  A  +  A'  -  E 


where 


a  =  Erw 

h= 0 


E  = 


( a\\  On 
\021  022. 


The  fully-modified  ordinary  least-squares  estimator  of  (b,  is  then  constructed  as 
follows.  Estimate  the  Eqs.  (16.8a)  and  (16.8b)  by  ordinary  least-squares  to  compute 
the  residuals  u,  =  ( u\r  ii'2t)' .  From  these  residuals  estimate  E  as  E  =  Yl]=i  ur M! 
and  the  long-run  variance  J  and  its  one-sided  counterpart  A.  Estimates  of  J  and  A, 
denoted  by  J  and  A,  can  be  obtained  by  applying  a  kernel  estimator  as  explained  in 
Sect.  4.4,  respectively  Sect.  11.1.  The  estimators  E,  J  and  A  are  consistent  because 
ordinary  least-squares  is.  The  corresponding  estimates  are  then  used  to  correct  the 
data  for  X\t  and  to  construct  the  bias  correction  term  A  {2\ 1 : 


=  Xu- 

=  A.21  —  A-22- 


The  fully-modified  ordinary  least-squares  estimator  (FMOLS  estimator)  is  then 
given  by 

T 

(*'„£>;) 

r=i 

It  turns  out  that  this  estimator  is  asymptotically  equivalent  to  full  maximum 
likelihood  with  limiting  distributions  free  of  nuisance  parameters. 

The  main  advantage  of  the  FMOLS  estimator  is  that  conventional  Wald  test 
statistics,  appropriately  modified,  have  limiting  y1  distributions.  This  brings  statisti¬ 
cal  inference  back  to  the  realm  of  traditional  econometric  analysis.  Consider  testing 
the  null  hypothesis 
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Ho  :  R  vec  b  =  q. 

where  q  is  a  vector  of  dimension  g  and  R  selects  the  appropriate  elements  of  vec  b. 
Thus,  in  effect  we  are  considering  hypothesis  of  the  form  Ho  :  b  =  bo.  The 
hypothesis  b  =  0  is  thereby  of  particular  interest.  The  Wald  test  statistic  is  then 
defined  as 

W  = 


(R  vec  b  —  q)' 


R  Jll.2  ®  J2(X2r  D',y(X2r  4)  R' 


{R  vec  b  —  q) 


where /n. 2  =  It  can  be  shown  that  the  so  defined  modified  Wald  test 

statistic  is  asymptotically  distributed  as  y2  with  g  degrees  of  freedom  (see  Phillips 
and  Hansen  1990;  Hansen  1992). 


16.5  An  Example 

This  example  reproduces  the  study  by  Neusser  (1991)  with  actualized  data  for 
the  United  States  over  the  period  first  quarter  1950  to  fourth  quarter  2005.  The 
starting  point  is  a  VAR  model  which  consists  of  four  variables:  real  gross  domestic 
product  (Y),  real  private  consumption  (C),  real  gross  investment  (I),  and  the  ex¬ 
post  real  interest  rate  (R).  All  variables,  except  the  real  interest  rate,  are  in  logs. 
First,  we  identify  a  VAR  model  for  these  variables  where  the  order  is  determined 
by  Akaike’s  (AIC),  Schwarz’  (B1C)  or  Hannan-Quinn’  (HQ)  information  criteria. 
The  AIC  suggests  seven  lags  whereas  the  other  criteria  propose  a  VAR  of  order  two. 
As  the  VAR(7)  consists  of  many  statistically  insignificant  coefficients,  we  prefer  the 
more  parsimonious  VAR(2)  model  which  produces  the  following  estimates: 


(  \ 

( 

\ 

0.185 

0.951 

0.254 

0.088 

0.042 

(0.047) 

(0.086) 

(0.091) 

(0.033) 

(0.032) 

(Y,\ 

0.069 

0.157 

0.746 

0.065 

-0.013 

c, 

— 

(0.043) 

+ 

(0.079) 

(0.084) 

(0.031) 

(0.030) 

It 

0.041 

0.283 

0.250 

1.304 

0.026 

\K,/ 

(0.117) 

(0.216) 

(0.229) 

(0.084) 

(0.081) 

-0.329 

0.324 

-0.536 

-0.024 

0.551 

\  (0.097)  / 

V  (0.178) 

(0.189) 

(0.069) 

(0.067)  / 
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(  -0.132 

-0.085 

-0.089 

i 

o 

b 

Q\ 

(0.085) 

(0.093) 

(0.033) 

(0.031) 

-0.213 

0.305 

-0.066 

0.112 

+ 

(0.078) 

(0.085) 

(0.031) 

(0.029) 

-0.517 

0.040 

-0.364 

0.098 

(0.214) 

(0.233) 

(0.084) 

(0.079) 

-0.042 

0.296 

0.005 

0.163 

l  (0.176) 

(0.192) 

(0.069) 

(0.065)  / 

where  the  estimated  standard  errors  of  the  corresponding  coefficients  are  reported 
in  parenthesis.  The  estimate  covariance  matrix  E,  E,  is 


E  =  10"4 


/0.722  0.428 
0.428  0.610 
1.140  1.026 

\0.002  -0.092 


1.140  0.002\ 
1.026  -0.092 
4.473  -0.328 
-0.328  3.098/ 


The  sequence  of  the  hypotheses  starts  with  H(0)  which  states  that  there  exists 
no  cointegrating  relation.  The  alternative  hypothesis  is  always  H(n)  which  says  that 
there  are  n  cointegrating  relations.  According  to  Table  16.2  the  value  of  the  trace 
test  statistic  is  1 1 1 .772  which  is  clearly  larger  than  the  5  %  critical  value  of  47.856. 
Thus,  the  null  hypothesis  H(0)  is  rejected  and  we  consider  next  the  hypothesis  H(  1). 
This  hypothesis  is  again  clearly  rejected  so  that  we  move  on  to  the  hypothesis 
H(2).  Because  H(3)  is  not  rejected,  we  conclude  that  there  exists  3  cointegrating 
relations.  To  check  this  result,  we  test  the  hypothesis  H(2)  against  H(3)  using  the 
max  test.  As  this  test  also  rejects  H(2),  we  can  be  pretty  confident  that  there  are 
three  cointegrating  relations  given  as: 


/  1.000  0.000  0.000\ 

0.000  1.000  0.000 

0.000  0.000  1.000 

V— 258.948  -277.869  -337.481/ 


Table  16.2  Evaluation  of  the  results  of  Johansen’s  cointegration 
test 


Trace  statistic 

Max  statistic 

Null  hypothesis 

Eigenvalue 

Test 

statistic 

Critical 

value 

Test 

statistic 

Critical 

value 

H(0)  :  r  =  0 

0.190 

111.772 

47.856 

47.194 

27.584 

H(l)  :  r  <  1 

0.179 

64.578 

29.797 

44.075 

21.132 

H(2)  :  r  <  2 

0.081 

20.503 

15.495 

18.983 

14.265 

H(3)  :  r  <  3 

0.007 

1.520 

3.841 

1.520 

3.841 

Critical  5  %  values  are  taken  from  MacKinnon  et  al.  (1999) 
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This  matrix  is  actually  the  outcome  from  the  EVIEWS  econometrics  software 
package.  It  should  be  noted  that  EVIEWS,  like  other  packages,  chooses  the 
normalization  mechanically.  This  can  become  a  problem  if  the  variable  on  which 
the  cointegration  vectors  are  normalized  is  not  part  of  the  cointegrating  relation. 

In  this  form,  the  cointegrating  vectors  are  economically  difficult  to  interpret.  We 
therefore  ask  whether  they  are  compatible  with  the  following  hypotheses: 


/  1.0\ 


Pc  = 


-1.0 
0.0 
V  0.0/ 


Pl  = 


(  L0\ 

0.0 

-1.0 


V  0.0/ 


Pr  = 


/0.0\ 

0.0 

0.0 

ll.o/ 


These  hypotheses  state  that  the  log-difference  (ratio)  between  consumption  and 
GDP,  the  log-difference  (ratio)  between  investment  and  GDP,  and  the  real  interest 
rate  are  stationary.  They  can  be  rationalized  in  the  context  of  the  neoclassical  growth 
model  (see  King  et  al.  1991;  Neusser  1991).  Each  of  them  can  be  brought  into 
the  form  of  Eq.  (16.16)  where  /I  is  replaced  by  its  estimate  /l.  The  corresponding 
test  statistics  for  each  of  the  three  cointegrating  relations  is  distributed  as  a  yp 
distribution  with  one  degree  of  freedom,13  which  gives  a  critical  value  of  3.84 
at  the  5  %  significance  level.  The  corresponding  values  for  the  test  statistic  are 
12.69,  15.05  and  0.45,  respectively.  This  implies  that  we  must  reject  the  first  two 
hypotheses  fie  and  /!/.  However,  the  conjecture  that  the  real  interest  is  stationary, 
cannot  be  rejected.  Finally,  we  can  investigate  the  joint  hypothesis  /3q  =  (ftc.  Pi ,  Pr) 
which  can  be  represented  in  the  form  (16.15).  In  this  case  the  value  of  the  test 
statistic  is  41.20  which  is  clearly  above  the  critical  value  of  7.81  inferred  from  the 
x\  distribution.14  Thus,  we  must  reject  this  joint  hypothesis. 

As  a  matter  of  comparison,  we  perform  a  similar  investigation  using  the  fully- 
modified  approach  of  Phillips  and  Hansen  (1990).  For  this  purpose  we  restrict 
the  analysis  to  Yt,  Ct,  and  I,  because  the  real  interest  rate  cannot  be  classified 
unambiguously  as  being  stationary,  respectively  integrated  of  order  one.  The  long- 
run  variance  J  and  its  one-sided  counterpart  A  are  estimated  using  the  quadratic 
spectral  kernel  with  VAR(  1 )  prewhitening  as  advocated  by  Andrews  and  Monahan 
(1992)  (see  Sect.  4.4).  Assuming  two  cointegrating  relations  and  taking  Y,  and  C, 
as  the  left  hand  side  variables  in  the  cointegrating  regression  (Eq.  (16.8a)),  the 
following  results  are  obtained: 


'  r,  \ 

c, 

V  / 


/  0.234  N 
(0.166) 


It  + 


'  6.282  ' 
(0.867) 


+ 


(  0.006  ' 
(0.002) 


0.215 

V  (0.171)  / 


5.899 
V  (0.892) 


0.007 

V  (0.002)  / 


+  Ml/ 


13The  degrees  of  freedom  are  computed  according  to  the  formula:  s(n  —  r)  =  1(4  —  3)  =  1. 
14The  degrees  of  freedom  are  computed  according  to  the  formula:  r(n  —  s)  =  3(4  —  3)  =  3. 
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where  the  estimated  standard  deviations  are  reported  in  parenthesis.  The  specifica¬ 
tion  allows  for  a  constant  and  a  deterministic  trend  as  well  as  a  drift  in  the  equation 
for  A  I,  (Eq.  (16.8b),  not  shown). 

Given  these  results  we  can  test  a  number  of  hypotheses  to  get  a  better  understand¬ 
ing  of  the  cointegrating  relations.  First  we  test  the  hypothesis  of  no  cointegration  of 
Y,,  respectively  C,  with  /,.  Thus,  we  test  Ho  :  b(  1)  =  b{2)  =  0.  The  value  of  the 
corresponding  Wald  test  statistic  is  equal  to  2.386  which  is  considerably  less  than 
the  5  %  critical  value  of  5.992.  Therefore  we  can  not  reject  the  null  hypothesis  no 
cointegration  Another  interesting  hypothesis  is  Ho  :  b(  1)  =  b( 2)  which  would 
mean  that  Yt  and  C,  are  cointegrated  with  cointegration  vector  (1,-1).  As  the 
corresponding  Wald  statistic  is  equal  to  0.315,  this  hypothesis  can  not  be  rejected  at 
the  5  %  critical  value  of  3.842.  This  suggests  a  long-run  relation  between  Y,  and  C,. 

Repeating  the  analysis  with  C,  and  I,  as  the  left  hand  side  variables  leads  to  the 
following  results: 


{  c,  ' 

1  0.834  N 

(  0.767  ^ 

(  0.002  ' 

= 

(0.075) 

Y,+ 

(0.561) 

+ 

(0.001) 

t  +  u\ 

I, 

2.192 

-11.27 

-0.008 

v  y 

V  (0.680)7 

V  (5.102)7 

V  (0.006)  7 

As  before  first  the  hypothesis  //()  :  b(  1)  =  b{ 2)  =  0  is  tested.  The  corresponding 
value  of  the  test  statistic  is  137.984  which  is  clearly  above  the  5  %  critical  value. 
Thus,  the  null  hypothesis  of  no  cointegration  is  rejected.  Next,  the  hypothesis  Ho  : 
£>(1)  =  b( 2)  =  1  is  tested.  This  hypothesis  is  rejected  as  the  value  7.717  of  the 
test  statistic  is  above  the  critical  value.  If  these  hypotheses  are  not  tested  jointly,  but 
individually,  the  null  hypothesis  b(  1)  =  1  can  be  rejected,  but  b( 2)  =  1  can  not. 
These  findings  conform  reasonably  well  with  those  based  on  the  Johansen  approach. 

The  diverse  result  between  the  two  specifications  demonstrates  that  the  sensi¬ 
tivity  of  cointegration  analysis  with  respect  to  the  normalization.  It  is  important 
that  the  variable  on  which  the  cointegrating  vector  is  normalized  is  indeed  in  the 
cointegrating  space.  Otherwise,  insensible  results  may  be  obtained. 


State-Space  Models  and  the  Kalman  Filter 


The  state  space  representation  is  a  flexible  technique  originally  developed  in 
automatic  control  engineering  to  represent,  model,  and  control  dynamic  systems. 
Thereby  we  summarize  the  unobserved  or  partially  observed  state  of  the  system  in 
period  t  by  an  m-dimensional  vector  X,.  The  evolution  of  the  state  is  then  described 
by  a  VAR  of  order  one  usually  called  the  state  equation.  A  second  equation  describes 
the  connection  between  the  state  and  the  observations  given  by  a  n-dimensional 
vector  Yt.  Despite  its  simple  structure,  state  space  models  encompass  a  large  variety 
of  model  classes:  VARMA,  respectively  VARIMA  models,1  unobserved-component 
models,  factor  models,  structural  time  series  models  which  decompose  a  given  time 
series  into  a  trend,  a  seasonal,  and  a  cyclical  component,  models  with  measurement 
errors,  VAR  models  with  time- varying  parameters,  etc. 

From  a  technical  point  of  view,  the  main  advantage  of  state  space  modeling  is 
the  unified  treatment  of  estimation,  forecasting,  and  smoothing.  At  the  center  of 
the  analysis  stands  the  Kalman-filter  named  after  its  inventor  Rudolf  Emil  Kalman 
(Kalman  1960,  1963).  He  developed  a  projection  based  algorithm  which  recursively 
produces  a  statistically  optimal  estimate  of  the  state.  The  versatility  and  the  ease  of 
implementation  have  made  the  Kalman  filter  an  increasingly  popular  tool  also  in  the 
economically  oriented  times  series  literature.  Here  we  present  just  an  introduction 
to  the  subject  and  refer  to  Anderson  and  Moore  (1979),  Brockwell  and  Davis  (1991; 
Chapter  12),  Brockwell  and  Davis  (1996;  Chapter  8),  Hamilton  (1994b;  Chapter  13), 
Hamilton  (1994a),  Hannan  and  Deistler  (1988),  or  Harvey  (1989),  and  in  particular 
to  Durbin  and  Koopman  (201 1)  and  Kim  and  Nelson  (1999)  for  extensive  reviews 
and  further  details. 


1  VARIMA  models  stand  for  vector  autoregressive  integrated  moving-average  models. 
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17.1  The  State  Space  Model 

We  consider  a  dynamical  system  whose  state  at  each  point  in  time  t  is  determined 
by  a  vector  Xt.  The  evolution  of  the  system  over  time  is  then  described  by  a 
state  equation.  The  state  is,  however,  unobserved  or  only  partly  observed  to  the 
outside  observer.  Thus,  a  second  equation,  called  the  observation  equation,  is 
needed  to  describe  the  connection  of  the  state  to  the  observations.  This  relation 
may  be  subject  to  measurement  errors.  The  equation  system  consisting  of  state  and 
observation  equation  is  called  a  state  space  model  which  is  visualized  in  Fig.  17.1. 
The  state  equation  typically  consists  of  a  VAR  model  of  order  one  whereas  the 
observation  equation  has  the  structure  of  multivariate  linear  regression  model.2 
Despite  the  simplicity  of  each  of  these  two  components,  their  combination  is  very 
versatile  and  able  to  represent  a  great  variety  of  models. 

In  the  case  of  time  invariant  coefficients3  we  can  set  up  these  two  equations  as 
follows: 

state  equation:  Xt+\  =  FXt  +  Vt+\,  t=  1,2 _  (17.1) 

observation  equation:  Y,  =  A  +  GX,  +  W,,  t  =  1,2 _  (17.2) 

Thereby  X,  denotes  an  m-dimensional  vector  which  describes  the  state  of  the  system 
in  period  t.  The  evolution  of  the  state  is  represented  as  a  vector  autoregressive  model 


error  Vt+1 


present  Xt 


internal  dynamics: 

Xt+i  =  FXt  +  Vt+1 
state  equation 


future  Xt+1 


observations:  Yt  =  A  +  GXt  +  Wt 
data  possibly  contaminated 
by  measurement  errors  Wt 


Fig.  17.1  State  space  model 


2We  will  focus  on  linear  dynamic  models  only.  With  the  availability  of  fast  and  cheap  computing 
facilities,  non-linear  approaches  have  gained  some  popularity.  See  Durbin  and  Koopman  (2011) 
for  an  exposition. 

3For  the  ease  of  exposition,  we  will  present  first  the  time-invariant  case  and  analyze  the  case  of 
time-varying  coefficients  later. 
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of  order  one  with  coefficient  matrix  F  and  disturbances  Vt+  \  ,4  As  we  assume  that  the 
state  X,  is  unobservable  or  at  least  partly  unobservable,  we  need  a  second  equation 
which  relates  the  state  to  the  observations.  In  particular,  we  assume  that  there  is 
a  linear  time-invariant  relation  given  by  A  and  G  of  the  n-dimensional  vector  of 
observations,  Yt,  to  the  state  Xt.  This  relation  may  be  contaminated  by  measurement 
errors  VT,.  The  system  is  initialized  in  period  t  =  1. 

We  make  the  following  simplifying  assumption  of  the  state  space  model  repre¬ 
sented  by  Eqs.  (17.1)  and  (17.2). 

(i)  {V,}  ~  WN(0,  Q)  where  Q  is  a  constant  nonnegative  definite  m  x  m  matrix. 

(ii)  {VT,}  ~  WN(0,  R)  where  R  is  a  constant  nonnegative  definite  n  x  n  matrix. 

(iii)  The  two  disturbances  are  uncorrelated  with  each  other  at  all  leads  and  lags,  i.e.: 

E( WsV't)  =  0,  for  all  t  and  s. 

(iv)  V,  and  VT,  are  multivariate  normally  distributed. 

(v)  X\  is  uncorrelated  with  Vt  as  well  as  with  W,,t  =  1,2,... 

Remark  17.1.  In  a  more  general  context,  we  can  make  both  covariance  matrices  Q 
and  R  time- varying  and  allow  for  contemporaneous  correlations  between  V,  and  VT, 
(see  example  Sect.  17.4.1). 

Remark  1 7.2.  As  both  the  state  and  the  observation  equation  may  include  identities, 
the  covariance  matrices  need  not  be  positive  definite.  They  can  be  non-negative 
definite. 

Remark  17.3.  Neither  { X, }  nor  {T,}  are  assumed  to  be  stationary. 

Remark  17.4.  The  specification  of  the  state  equation  and  the  normality  assumption 
imply  that  the  sequence  {Zj,  V\,  V2,  •  •  •}  is  independent  so  that  the  conditional 
distribution  X,+  1  given  X, .  X,_  t , . . . ,  X\  equals  the  conditional  distribution  of  X,+1 
given  Xr.  Thus,  the  process  {X,}  satisfies  the  Markov  property.  As  the  dimension  of 
the  state  vector  X,  is  arbitrary,  it  can  be  expanded  in  such  a  way  as  to  encompass 
every  component  X,_i  for  any  t  (see,  for  example,  the  state  space  representation  of 
a  VAR(p)  model  with  p  >  1).  However,  there  remains  the  problem  of  the  smallest 
dimension  of  the  state  vector  (see  Sect.  17.3.2). 

Remark  17.5.  The  state  space  representation  is  not  unique.  Defining,  for  example, 
a  new  state  vector  X,  by  multiplying  X,  with  an  invertible  matrix  P,  i.e.  X,  =  PX,, 
all  properties  of  the  system  remain  unchanged.  Naturally,  we  must  redefine  all  the 
system  matrices  accordingly:  F  =  PFP~ 1 ,  Q  =  PQP',  G  =  GP_I. 


4In  control  theory  the  state  equation  (17.1)  is  amended  by  an  additional  term  HU,  which  represents 
the  effect  of  control  variables  U,.  These  exogenous  controls  are  used  to  regulate  the  system. 
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Given  X\ ,  we  can  iterate  the  state  equation  forward  to  arrive  at: 
t- 1 

X,  =  F‘-'Xt  +Y2Fi~'vt+i-j’ 

7=1 

t—l 

Y, =A  +  GF,~lX1  +  GFj~lV,+\-j  +  W„ 

7=1 

The  state  equation  is  called  stable  or  causal  if  all  eigenvalues  of  F  are  inside  the  unit 
circle  which  is  equivalent  that  all  roots  of  det(Im — Fz)  =  0  are  outside  the  unit  circle 
(see  Sect.  12.3).  In  this  case  the  state  equation  has  a  unique  stationary  solution: 

OO 

Xt=Y,F-lVt+i-j.  (17.3) 

7=0 

The  process  { Y ,}  is  therefore  also  stationary  and  we  have: 

OO 

Y'=A  +  J2  GfJ~'V,+  \-j  +  Wt.  (17.4) 

7=0 

In  the  case  of  a  stationary  state  space  model,  we  may  do  without  an  initialization 
period  and  take  t  e  Z. 

In  the  case  of  a  stable  state  equation,  we  can  easily  deduce  the  covariance 
function  for  {X,},  Tx(h),  h  =  0,  1,2,...  According  to  Sect.  12.4  it  holds  that: 

Tx(0)  =  FTx(0)F'  +  Q, 

rx(h)  =  Fhrx(  0),  h  =  i,2,... 

where  Tx(0)  is  uniquely  determined  given  the  stability  assumption.  Similarly,  we 
can  derive  the  covariance  function  for  the  observation  vector,  TY{h),h  =  0,1,2,...: 

Ty(0)  =  Grx(0)G'  +  R, 

rY(h)  =  GFhrx(0)G\  h  —  1,2,... 


t  =  1,2,... 

t=  1,2,... 


17.1.1  Examples 

The  following  examples  should  illustrate  the  versatility  of  the  state  space  model 
and  demonstrate  how  many  economically  relevant  models  can  be  represented  in  this 
form. 
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VAR(p)  Process 

Suppose  that  {Yt}  follows  a  /(-dimensional  VAR(p)  process  given  by  <Y>(L)  Y,  =  Zt, 
respectively  by  Y,  =  OiTf_i  +  . . .  +  <Y>pY,-p  +  Z,,  with  Zt  ~  WN(0,  X).  Then 
the  companion  form  of  the  VAR(p)  process  (see  Sect.  12.2)  just  represents  the  state 
equation  (17.1): 


(  Yt+ 1  \ 

/O,  0>2  .  .  .  <t>p\ 

( Yl  \ 

(z,+x\ 

Y, 

In  0  ...  0  0 

Yt- 1 

0 

X,+1  = 

Yt- 1 

= 

0  /„  ...  0  0 

Y,—2 

+ 

0 

\v,-P+J 

V  0  0  ...  In  0  ) 

\Y,-p > 

V  0  ) 

=  FX,  +  Vt+ 1, 


with  V,+  i  =  (Z'+1, 0, 0, ... ,  0)'  and  Q  =  ^  q  q|  •  The  observation  equation  is  just 
an  identity  because  all  components  of  X,  are  observable: 

Yt  =  (I„,  0, 0, ... ,  0)X,  =  GX„ 

Thus,  G  =  (/„,  0,  0, ... ,  0)  and  R  =  0.  Assuming  that  X,  is  already  mean  adjusted, 
A  =  0. 


ARMA(1,1)  Process 

The  representation  of  ARMA  processes  as  a  state  space  model  is  more  involved 
when  moving-average  terms  are  involved.  Let  {T,}  be  an  ARMA(1,1)  process 
defined  by  the  stochastic  difference  equation  Y,  =  (]>  Y,~ \  +  Z,  +  9Zt-\  with 
Z,  ~  WN(0,  ct2)  and  <j>9  ±  0. 

Define  {A,}  as  the  AR(1)  process  defined  by  the  stochastic  difference  equation 
X,  —  <f>Xt- 1  =  Z,  and  X,  =  (Xt,Xt-i )'  as  the  state  vector,  then  we  can  write  the 
observation  equation  as: 


Y,  =  (1,  9)Xt  =  GX, 


with  R  =  0.  The  state  equation  is  then 


1  +  Vt+ 1, 


where  Q  =  ^  j .  It  is  easy  to  verify  that  the  so  defined  process  {Tf}  satisfies 


the  stochastic  difference  equation  Y,  =  <J>  Y,-  \  +  Z,  +  9Z,-\.  Indeed  Y,  —  (J>  K,_  |  = 
(1,0)X,-0(1,0)X^1  =  Xt  +  9X,-l-4>Xt-1-94>Xt-2  =  (X,-4>X,-l)  +  9(Xl-i~ 
4>x,-2)  =  z,  +  9Z,-X. 
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If  101  ^  1)  the  state  equation  defines  a  causal  process  {XfJ  so  that  the  unique 
stationary  solution  is  given  by  Eq.  (17.3).  This  implies  a  stationary  solution  for  {Y,} 
too.  It  is  thus  easy  to  verify  if  this  solution  equals  the  unique  solution  of  the  ARMA 
stochastic  difference  equation. 

The  state  space  representation  of  an  ARMA  model  is  not  unique.  An  alternative 
representation  in  the  case  of  a  causal  system  is  given  by: 

^r+i  =  <f>X>  +  (<P  +  9)Z,  =  FX,  +  Vt+ 1 
Y,  =  X,  +  Z,  =  X,  +  W,. 


Note  that  in  this  representation  the  dimension  of  the  state  vector  is  reduced  from  two 
to  one.  Moreover,  the  two  disturbances  Vt+\  =  ((/)  +  9)Z,  and  W,  =  Z,  are  perfectly 
correlated. 

ARMA(p,q)  Process 

It  is  straightforward  to  extend  the  above  representation  to  ARMA(p,q)  models.5  Let 
{Yt}  be  defined  by  the  following  stochastic  difference  equation: 

$(L)F,  =  @(L)Z,  with  Z,  ~  WN(0,  cr2)  and  <t>p9q  ±  0. 

Define  r  as  r  =  ma x{p,  q  +  1 }  and  set  <pj  =  0  for  j  >  p  and  9j  =  0  for  j  >  q. 
Then,  we  can  set  up  the  following  state  space  representation  with  state  vector  X, 
and  observation  equation 


Y,  =  (  l,9u...,9r-i)Xt 

where  the  state  vector  equals  X,  =  (X,, .. . ,  X,_,.+2,X,_h_i  )/  and  where  {X,}  follows 
an  AR(p)  process  <b(L)X,  =  Zf.  The  AR(p)  process  can  be  transformed  into 
companion  form  to  arrive  at  the  state  equation: 


Missing  Observations 

The  state  space  approach  is  best  suited  to  deal  with  missing  observations.  However, 
in  this  situation  the  coefficient  matrices  are  no  longer  constant,  but  time-varying. 
Consider  the  following  simple  example  of  an  AR(1)  process  for  which  we  have 


5See  also  Exercise  17.2. 
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only  observations  for  the  periods  t  =  1, ....  100  and  t  =  102, ... ,  200,  but  not  for 
period  t  =  101  which  is  missing.  This  situation  can  be  represented  in  state  space 
form  as  follows: 


Xt+ 1  —  <l>Xt  +  z, 
Y,  =  G,X,  +  W, 


(  1,  t=  1,...,  100, 102,..., 200; 

(  0,  t  =  101. 

(  0,  t  =  1,...,  100, 102,..., 200; 
(  c  >  0,  t  =  101. 


This  means  that  W,  =  0  and  that  Y,  =  X,  for  all  t  except  for  t  =  101.  For  the 
missing  observation,  we  have  Gioi  =  Kim  =  0.  The  variance  for  this  observation  is 
set  to  Rm  =  c  >  0. 

The  same  idea  can  be  used  to  obtain  quarterly  data  when  only  yearly  data  are 
available.  This  problem  typically  arises  in  statistical  offices  which  have  to  produce, 
for  example,  quarterly  GDP  data  from  yearly  observations  incorporating  quarterly 
information  from  indicator  variables  (see  Sect.  17.4.1).  More  detailed  analysis  for 
the  case  of  missing  data  can  be  found  in  Harvey  and  Pierce  (1984)  and  Brockwell 
and  Davis  (1991;  Chapter  12.3). 

Time-Varying  Coefficients 

Consider  the  regression  model  with  time- varying  parameter  vector  fi,: 

Y,  =  x'p,  +  W,  (17.5) 

where  Y,  is  an  observed  dependent  variable,  x,  is  a  AT-vector  of  exogenous  regressors, 
and  W,  is  a  white  noise  error  term.  Depending  on  the  specification  of  the  evolution 
of  ft,  several  models  have  been  proposed  in  the  literature: 

Hildreth-Houck  :  /3,  =  ft  +  vt 

Harvey-Phillips:  P,  —  ft  =  F(fi,  —  ft)  +  v, 

Cooley-Prescott;  /3,  =  +  iq, 

=  P'-l  +  V2, 

where  vt,  v\t,  and  iq,  are  white  noise  error  terms.  In  the  first  specification,  proposed 
originally  proposed  by  Hildreth  and  Houck  (1968),  the  parameter  vector  is  in 
each  period  just  a  random  from  a  distribution  with  mean  /3  and  variance  given 
by  the  variance  of  vt.  Departures  from  the  mean  are  seen  as  being  only  of  a 
transitory  nature.  In  the  specification  by  Harvey  and  Phillips  (1982),  assuming  that 
all  eigenvalues  of  F  are  strictly  smaller  than  one  in  absolute  value,  the  parameter 
vector  is  a  mean  reverting  VAR  of  order  one.  In  this  case,  the  departures  from 
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the  mean  can  have  a  longer  duration  depending  on  the  eigenvalues  of  F.  The  last 
specification  due  to  Cooley  and  Prescott  (1973,  1976)  views  the  parameter  vector  as 
being  subject  to  transitory  and  permanent  shifts.  Whereas  shifts  in  t>i,  have  only  a 
transitory  effect  on  />,,  movements  in  i'2,  result  in  permanent  effects. 

In  the  Cooley-Prescott  specification,  for  example,  the  state  is  given  by 
Xt  =  (J}'t,  /If ')'  and  the  state  equation  can  be  written  as: 


=F 


The  observation  equation  then  becomes: 

Y,=y,=  Q  X,  +  w, 

Thus,  A  =  0  and  G,  =  (xj,  0).  Note  that  this  an  example  of  a  state  space  model  with 
time- varying  coefficients.  In  Sect.  18.2,  we  will  discuss  time- varying  coefficient 
models  in  the  context  of  VAR  models. 

Structural  Time  Series  Analysis 

An  important  application  of  the  state  space  representation  in  economics  is  the 
decomposition  of  a  given  time  series  into  several  components:  trend,  cycle,  season 
and  irregular  component.  This  type  of  analysis  is  usually  coined  structural  time 
series  analysis  (See  Harvey  1989;  Mills  2003).  Consider,  for  example,  the  additive 
decomposition  of  a  time  series  {T,}  into  a  trend  T, ,  a  seasonal  component  St,  a 
cyclical  component  {C,},  and  an  irregular  or  cyclical  component  W,: 

YI  =  T,  +  SI  +  C,  +  Wt. 

The  above  equation  relates  the  observed  time  series  to  its  unobserved  components 
and  is  called  the  basic  structural  model  (BSM)  (Harvey  1989). 

The  state  space  representation  is  derived  in  several  steps.  Consider  first  the  case 
with  no  seasonal  and  no  cyclical  component.  The  trend  is  typically  viewed  as  a 
random  walk  with  time- varying  drift  8t-i : 


T,  =  1  +  Tt-\  +  e„  e,  ~  WN(0,  of) 

8,  =  8,-x+^  f,~WN(0  ,of). 

The  second  equation  models  the  drift  as  a  random  walk.  The  two  disturbances  {st} 
and  {£,}  are  assumed  to  be  uncorrelated  with  each  other  and  with  { VV', } .  Defining  the 
state  vector  X,  as  X,  =  (Tr.  8,)',  the  state  and  the  observation  equations  become: 


X, 


(T) 

t+ 1 


Y,  =  (1, 0)X^  +  W, 
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with  W,  ~  WN(0,ffJ,).  This  representation  is  called  the  local  linear  trend  (LLT) 
model  and  implies  that  { T,}  follows  an  ARIMA(0,2,2)  process  (see  Exercise  17.5.1). 

In  the  special  case  of  a  constant  drift  equal  to  8,  cr|  =  0  and  we  have  that 
AT,  =  S  +  st  +  W,  —  Wt- 1-  {AT,}  therefore  follows  a  MA(1)  process  with 
p(l)  =  —  +  2cr^)  =  —(2  +  k)-1  where  k  =  is  called  the  signal- 

to-noise  ratio.  Note  that  the  first  order  autocorrelation  is  necessarily  negative.  Thus, 
this  model  is  not  suited  for  time  series  with  positive  first  order  autocorrelation  in  its 
first  differences. 

The  seasonal  component  is  characterized  by  two  conditions  St  =  St-d  and 
Y^!=i  Sr  =  0  where  d  denotes  the  frequency  of  the  data.6  Given  starting  values 
S) ,  Sq,  S- 1 S-d+  3,  the  subsequent  values  can  be  computed  recursively  as: 


St+i  —  —Si  —  ...  —  St-d+2  +  rjt+i,  t  —  1,2,... 

where  a  noise  rj,  ~  WN(0,  o^)  is  taken  into  account.  The  state  vector  related  to  the 

seasonal  component,  x]s\  is  defined  as  x[s>  =  (5,,  St-\, _ St~d+  lY  which  gives 

the  state  equation 


yCS)  _ 

Af+i  — 


/-I 

-l 

..  -1 

-1\ 

(nt+ A 

1 

0 

..  0 

0 

0 

0 

l 

..  0 

0 

4S)  + 

0 

V  o 

0 

..  1 

0  ) 

l  o  / 

=  E(S)A,(S)  + 


with  Q(s)  =  diag(n^,  0, ....  0). 

Combining  the  trend  and  the  seasonal  model  to  an  overall  model  with  state  vector 
X,  given  by  X,  =  ix\T)' ,x\s^'\  ,  we  arrive  at  the  state  equation: 


X,+i  = 


=  FX,  +  V,+ 1 


with  Q  =  diag  (erf,  a|,  cr^,  0 . 0).  The  observation  equation  then  is: 


T,  =  (1  0  1  0  . . .  0)  A,  +  W, 


with  R  =  a^f. 

Finally,  we  can  add  a  cyclical  component  {C,}  which  is  modeled  as  a  harmonic 
process  (see  Sect.  6.2)  with  frequency  Ac,  respectively  periodicity  2jt/Xc' 

C,  =  A  cos(A  c0  +  B  sin(Acf) 


sFour  in  the  case  of  quarterly  and  twelve  in  the  case  of  monthly  observations. 
’Alternative  seasonal  models  can  be  found  in  Harvey  (1989)  and  Hylleberg  (1986). 
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Fig.  1 7.2  Spectral  density  of  the  cyclical  component  for  different  values  of  Xc  and  p 


Following  Harvey  (1989;  p.39),  we  let  the  parameters  A  and  B  evolve  over  time  by 
introducing  the  recursion 

( C,+  1\  =  n(  cos^c  sin^<A  (  c>\  ,  ( Vu+i 

\C:J  \  —  sin Ac  cos Xc)  \ C* /  \^+1 

where  Co  =  A  and  C(*  =  B  and  where  {C* }  is  an  auxiliary  process.  The  dampening 
factor  p  allows  for  additional  flexibility  in  the  specification.  The  processes  {Vj  t  } 
and  {v'f?}  are  two  mutually  uncorrelated  white  noise  processes.  It  is  instructive  to 
examine  the  spectral  density  (see  Sect.  6.1)  of  the  cyclical  component  in  Fig.  17.2. 
It  can  be  shown  (see  Exercise  17.5.2)  that  {Cr}  follows  an  ARMA(2,1)  process. 

The  cyclical  component  can  be  incorporated  into  the  state  space  model  above 
by  augmenting  the  state  vector  Xt+\  by  the  cyclical  components  Ct+ 1  and  C*+l 
and  the  error  term  Vt+\  by  { v\Cr\_  t }  and  {Vj^+i}-  The  observations  equation  has  to 
be  amended  accordingly.  Section  17.4.2  presents  an  empirical  application  of  this 
approach. 

Dynamic  Factor  Models 

Dynamic  factor  models  are  an  interesting  approach  when  it  comes  to  modeling 
simultaneously  a  large  cross-section  of  times  series.  The  concept  was  introduced 
into  macroeconomics  by  Sargent  and  Sims  (1977)  and  was  then  developed  further 
and  popularized  by  Quah  and  Sargent  (1993),  Reichlin  (2003)  and  Breitung  and 
Eickmeier  (2006),  among  others.  The  idea  is  to  view  each  time  series  Yit,  i  = 
1 . 72,  as  the  sum  of  a  linear  combination  of  some  joint  unobserved  factors 
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ft  =  (fit ,  ■  ■  ■ .  frt )'  and  an  idiosyncratic  component  { W„},  i  =  1, . . .  n.  Dynamic 
factor  models  are  particularly  effective  when  the  number  of  factors  r  is  small 
compared  to  the  number  of  time  series  n.  In  practice,  several  hundred  time  series 
are  related  to  a  handful  factors.  In  matrix  notation  we  can  write  the  observation 
equation  for  the  dynamic  factor  model  as  follows: 


Yt  —  A  of,  +  +  . . .  +  A  qf,-q  +  Wt 


where  A,,  i  =  0.  I .....  ty,  are  n  x  r  matrices.  The  state  vector  X,  equals 
iff  ■  ■  ■  Jl-q)'  if  we  assume  that  the  idiosyncratic  component  is  white  noise,  i.e. 
Wt  =  (W\t,  •  •  • .  Wnty  ~  WN(0.  R).  The  observation  equation  can  then  be  written 
compactly  as: 


Y,  =  GX,  +  Wt 


where  G  =  (Ao,  Aj, . . . ,  f\q).  Usually,  we  assume  that  R  is  a  diagonal  matrix.  The 
correlation  between  the  different  time  series  is  captured  exclusively  by  the  joint 
factors. 

The  state  equation  depends  on  the  assumed  dynamics  of  the  factors.  One 
possibility  is  to  model  {/,}  as  a  VAR(p)  process  with  <f>(L)/j  =  et,  e,  ~  WN(0,  E), 
and  p  <  q  +  1 ,  so  we  can  use  the  state  space  representation  of  the  VAR(p)  process 
from  above.  For  the  case  p  =  2  and  q  =  2  we  get: 


02  ^ 
Ir  0  0 

0  I,  0, 


and  Q  =  diag(E ,  0,  0).  This  scheme  can  be  easily  generalized  to  the  case  p  >  q  +  1 
or  to  allow  for  autocorrelated  idiosyncratic  components,  assuming  for  example  that 
they  follow  autoregressive  processes. 

The  dimension  of  these  models  can  be  considerably  reduced  by  an  appropriate 
re-parametrization  or  by  collapsing  the  state  space  adequately  (Brauning  and 
Koopman  2014).  Such  a  reduction  can  considerably  increase  the  efficiency  of  the 
estimation. 

Real  Business  Cycle  Model  (RBC  Model) 

State  space  models  are  becoming  increasingly  popular  in  macroeconomics,  espe¬ 
cially  in  the  context  of  dynamic  stochastic  general  equilibrium  (DSGE)  models. 
These  models  can  be  seen  as  generalizations  of  the  real  business  cycle  (RBC) 
models.8  In  these  models  a  representative  consumer  is  supposed  to  maximize  the 
utility  of  his  consumption  stream  over  his  infinite  life  time.  Thereby,  the  consumer 
has  the  choice  to  consume  part  of  his  income  or  to  invest  his  savings  (part  of  his 


“Prototypical  models  can  be  found  in  King  et  al.  (1988)  or  Woodford  (2003).  Canova  (2007)  and 
Dejong  and  Dave  (2007)  present  a  good  introduction  to  the  analysis  of  DSGE  models. 
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income  which  is  not  consumed)  at  the  market  rate  of  interest.  These  savings  can 
be  used  as  a  mean  to  finance  investment  projects  which  increase  the  economy  wide 
capital  stock.  The  increased  capital  stock  then  allows  for  increased  production  in  the 
future.  The  production  process  itself  is  subject  to  a  random  shocks  called  technology 
shocks. 

The  solution  of  this  optimization  problem  is  a  nonlinear  dynamic  system 
which  determines  the  capital  stock  and  consumption  in  every  period.  Its  local 
behavior  can  be  investigated  by  linearizing  the  system  around  its  steady  state. 
This  equation  can  then  be  interpreted  as  the  state  equation  of  the  system.  The 
parameters  of  this  equation  F  and  Q  are  related,  typically  in  a  nonlinear  way,  to 
the  parameters  describing  the  utility  and  the  production  function  as  well  as  the 
process  of  technology  shocks.  Thus,  the  state  equation  summarizes  the  behavior 
of  the  theoretical  model. 

The  parameters  of  the  state  equation  can  then  be  estimated  by  relating  the  state 
vector,  given  by  the  capital  stock  and  the  state  of  the  technology,  via  the  observation 
equation  to  some  observable  variables,  like  real  GDP,  consumption,  investment,  or 
the  interest  rate.  This  then  completes  the  state  space  representation  of  the  model 
which  can  be  analyzed  and  estimated  using  the  tools  presented  in  Sect.  17. 3.9 


1 7.2  Filtering  and  Smoothing 

As  we  have  seen,  the  state  space  model  provides  a  very  flexible  framework  for  a 
wide  array  of  applications.  We  therefore  want  to  develop  a  set  of  tools  to  handle 
this  kind  of  models  in  terms  of  interpretation  and  estimation.  In  this  section  we 
will  analyze  the  problem  of  inferring  the  unobserved  state  from  the  data  given  the 
parameters  of  the  model.  In  Sect.  17.3  we  will  then  investigate  the  estimation  of  the 
parameters  by  maximum  likelihood. 

In  many  cases  the  state  of  the  system  is  not  or  only  partially  observable.  It  is 
therefore  of  interest  to  infer  from  the  data  Y\ ,  I2, . . . ,  Yj  the  state  vector  X,.  We  can 
distinguish  three  types  of  problems  depending  on  the  information  used: 

(i)  estimation  of  X,  from  lj , . . . ,  F,_i ,  known  as  the  prediction  problem; 

(ii)  estimation  of  X,  from  Y\, . . .  ,Yt,  known  as  the  filtering  problem; 

(iii)  estimation  of  X,  from  Yi , . . . ,  Yr,  known  as  the  smoothing  problem. 

For  the  ease  of  exposition,  we  will  assume  that  the  disturbances  Vt  and  W, 
are  normally  distributed.  The  recursive  nature  of  the  state  equation  implies  that 
X,  =  F‘~lX  1  +  T!~=>Vt-j.  Therefore,  X,  is  also  normally  distributed  for  all 
t,  if  X\  is  normally  distributed.  From  the  observation  equation  we  can  infer  also 


9See  Sargent  (2004)  or  Fernandez- Villaverde  et  al.  (2007)  for  systematic  treatment  of  state  space 
models  in  the  context  of  macroeconomic  models.  In  this  literature  the  use  of  Bayesian  methods  is 
widespread  (see  An  and  Schorfheide  2007;  Dejong  and  Dave  2007). 
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that  Y,  is  normally  distributed,  because  it  is  the  sum  of  two  normally  distributed 
random  variables  A  +  GX,  and  W, .  Thus,  under  these  assumptions,  the  vector 
(X[, ...  ,X'T,Y[, ,  Y'j)'  is  jointly  normally  distributed: 

(xx\ 


\Yt) 

where  the  covariance  matrices  1~V,  Tyx ,  TXy  and  Fj-  can  be  retrieved  from  the  model 
given  the  parameters. 

For  the  understanding  of  the  rest  of  this  section,  the  following  theorem  is  essential 
(see  standard  textbooks,  like  Amemiya  1994;  Greene  2008). 

Theorem  17.1.  Let  Z  be  a  n-dimensional  normally  distributed  random  variable 
with  Z  ~  N(/r,  X).  Consider  the  partitioned  vector  Z  =  {ZfZ'f)'  where  Z\  and  Z2 
are  of  dimensions  n  \  >  1  and  m  >  1,  n  =  n\  +  ni,  respectively.  The  corresponding 
partitioning  of  the  covariance  matrix  X  is 

He"e!) 

V-^21  -^22/ 

where  Xu  =  VZi,  X22  =  VZ2,  and  X 1 2  =  X2,  =  cov(Zi,Z2)  =  E(Zj  —  EZi)'(Z2— 
EZ2).  Then  the  partitioned  vectors  Z\  and  Z2  are  normally  distributed.  Moreover,  the 
conditional  distribution  ofZ\  given  Z2  is  also  normal  with  mean  and  variance 

E(Z1|Z2)  =  EZi  +  X12Xy2I(Z2  -EZ2), 

V(Z1|Z2)  =  Xu  —  x12x221x21. 

This  formula  can  be  directly  applied  to  figure  out  the  mean  and  the  variance 

of  the  state  vector  given  the  observations.  Thus,  setting  Z\  =  (A, _ _  X')'  and 

Z2  =  (Y\, . . . ,  Y'_ j)',  we  get  the  predicted  values;  setting  Z\  =  (X\ , . . .  ,X')'  and 

Z2  =  (Fj, . . . ,  Y')',  we  get  the  filtered  values;  setting  Z\  =  (X[ _ _  X')' 

and  Z2  =  ( Fj , . . . ,  Fj.)',  we  get  the  smoothed  values. 

AR(1)  Process  with  Measurement  Errors 

We  illustrate  the  above  ideas  by  analyzing  a  univariate  AR(1)  process  with 
measurement  errors10: 


10Sargent  (1989)  provides  an  interesting  application  showing  the  implications  of  measurement 
errors  in  macroeconomic  models. 
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X,+1  =  <f>X,  +  vt+l,  v,~  IIDN(0,  of) 


Y,  —  X,  +  Wt, 


w,  ~  IIDN(0,  of). 


For  simplicity,  we  assume  \<p\  <  1.  Suppose  that  we  only  have  observations  Y\ 
and  Yi  at  our  disposal.  The  joint  distribution  of  (A) ,  Ah,  fj,  I?)'  is  normal.  The 
covariances  can  be  computed  by  applying  the  methods  discussed  in  Chap.  2: 
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The  smoothed  values  are  obtained  by  applying  the  formula  from  Theorem  17.1: 


Note  that  for  the  last  observation,  lb  in  our  case,  the  filtered  and  the  smoothed  values 
are  the  same.  For  Aj  the  filtered  value  is 


E(*.|r.)  =  — 

<r„2 

An  intuition  for  this  result  can  be  obtained  by  considering  some  special  cases. 
For  (j>  =  0,  the  observations  are  not  correlated  over  time.  The  filtered  value  for  Aj 
therefore  corresponds  to  the  smoothed  one.  This  value  lies  between  zero,  the  uncon¬ 
ditional  mean  of  Aj,  and  fj  with  the  variance  ratio  ct2/ct2  delivering  the  weights: 
the  smaller  the  variance  of  the  measurement  error  the  closer  the  filtered  value  is  to 
Y\ .  This  conclusion  holds  also  in  general.  If  the  variance  of  the  measurement  error  is 
relatively  large,  the  observations  do  not  deliver  much  information  so  that  the  filtered 
and  the  smoothed  values  are  close  to  the  unconditional  mean. 

For  large  systems  the  method  suggested  by  Theorem  17.1  may  run  into  numerical 
problems  due  to  the  inversion  of  the  covariance  matrix  of  Y,  E22.  This  matrix  can 
become  rather  large  as  it  is  of  dimension  nT  x  nT.  Fortunately,  there  exist  recursive 
solutions  to  this  problem  known  as  the  Kalman  filter,  and  also  the  Kalman  smoother. 
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17.2.1  The  Kalman  Filter 

The  Kalman  filter  circumvents  the  problem  of  inverting  a  large  nT  x  nT  matrix 
by  making  use  of  the  Markov  property  of  the  system  (see  Remark  17.4).  The 
distribution  of  X,  given  the  observations  up  to  period  t  can  thereby  be  computed 
recursively  from  the  distribution  of  the  state  in  period  t  —  1  given  the  information 
available  up  to  period  t  —  1 .  Starting  from  some  initial  distribution  in  period  0,  we 
can  in  this  way  obtain  in  T  steps  the  distribution  of  all  states.  In  each  step  only  an 
n  x  n  matrix  must  be  inverted.  To  describe  the  procedure  in  detail,  we  introduce  the 
following  notation: 


. Yh)=X,  |* 

V{Xt\Yu...,Yh)  =  TV 

Suppose,  we  have  already  determined  the  distribution  of  X,  conditional  on  the 
observations  Y\, . . . ,  Y,.  Because  we  are  operating  in  a  framework  of  normally 
distributed  random  variables,  the  distribution  is  completely  characterized  by  its 
conditional  mean  Xt\,  and  variance  P,\t.  The  goal  is  to  carry  forward  these  entities 
to  obtain  Xf+i|,+  i  and  P(+i|I+i  having  observed  an  additional  data  point  Y,+  \ .  This 
problem  can  be  decomposed  into  a  forecasting  and  an  updating  step. 

Step  1:  Forecasting  Step  The  state  equation  and  the  assumption  about  the 
disturbance  term  Vt+i  imply: 


X*  i|r  =  FX,\t  (17.6) 

Pt+i\t  =  FP,\,F'  +  Q 

The  observation  equation  then  allows  to  compute  a  forecast  of  Y,+ 1  where  we 
assume  for  simplicity  that  A  =  0: 


Yt+i\t  =  GX,+  i|f 


(17.7) 


Step  2:  Updating  Step  In  this  step  the  additional  information  coming  from  the 
additional  observation  Y,+  \  is  processed  to  update  the  conditional  distribution  of  the 
state  vector.  The  joint  conditional  distribution  of  (X't+] ,  Y't+ , )'  given  Y\ , ... ,  Y,  is 


P,+x\,G' 
GPt+i\,G'  +  R 


As  all  elements  of  the  distribution  are  available  from  the  forecasting  step,  we  can 
again  apply  Theorem  17.1  to  get  the  distribution  of  the  filtered  state  vector  at  time 
t  - hi: 


Xh-Hh-i  =  X'+M'  +  Pt+M/G' (GPt+\\tG'  +  Ryl(Yt+i  -  Ft+1,() 
P,+i ih-i  =  Pt+i\,  ~  Pt+\\iG' (GPt+i\,G'  +  R)~XGP f+i|f 


(17.8) 

(17.9) 
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where  we  replace  Xt+\\t,  Pt+\\t,  and  Yt+i\t  by  FXt\t,  FPt\,F'  +  Q,  and  GFX, |,, 
respectively,  which  have  been  obtained  from  the  forecasting  step. 

Starting  from  given  values  for  and  /J0|o,  we  can  therefore  iteratively  compute 
X,\,  and  P,\,  for  all  t  =  1,2 , ,T.  Only  the  information  from  the  last  period  is 
necessary  at  each  step.  Inserting  Eq.  (17.8)  into  Eq.  (17.6)  we  obtain  as  a  forecasting 
equation: 


Xr+1|,  =  FX, M  +  FP,\,-\G' (GP +  R)~\Yt  -  GXt]t.,) 


where  the  matrix 


K,  =  FPt\t_\G' (GPt\t-\G'  +  R)~] 

is  know  as  the  (Kalman)  gain  matrix.  It  prescribes  how  the  innovation  Y,  —  Y, |f_i  = 
Y,  —  GX, |,_!  leads  to  an  update  of  the  predicted  state. 

Initializing  the  Algorithm  It  remains  to  determine  how  to  initialize  the  recursion. 
In  particular,  how  to  set  the  starting  values  for  Ao|o  and  P0jo-  If  X,  is  stationary  and 
causal  with  respect  to  Vt,  the  state  equation  has  the  solution  Xo  =  Ylj^o  F^Vt~j. 
Thus, 


X0|o  =  E(I0)  =  0 

Po|o  =  V(X0) 

where  P0|o  solves  the  equation  (see  Sect.  12.4) 

A>|0  =  +  Q- 

According  to  Eq.  (12.4),  the  solution  of  the  above  matrix  equation  is: 

vec(P0|o)  =  [1  ~  F  <g>  P]_1vec(Q). 


If  the  process  is  not  stationary,  we  can  set  X0|o  to  zero  and  P0|o  to  infinity.  In 
practice,  a  very  large  number  is  sufficient. 


17.2.2  The  Kalman  Smoother 

The  Kalman  filter  determines  the  distribution  of  the  state  at  time  t  given  the 
information  available  up  to  this  time.  In  many  instances,  we  want,  however,  make 
an  optimal  forecast  of  the  state  given  all  the  information  available,  i.e.  the  whole 
sample.  Thus,  we  want  to  determine  X,\T  and  P,\T.  The  Kalman  filter  determines 
the  smoothed  distribution  for  t  =  7\  i.e.  XT\ T  and  PT\T.  The  idea  of  the  Kalman 
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smoother  is  again  to  determine  the  smoothed  distribution  in  a  recursive  manner.  For 
this  purpose,  we  let  the  recursion  run  backwards.  Starting  with  the  last  observation 
in  period  t  =  T,  we  proceed  back  in  time  by  letting  t  take  successively  the  values 
t  =  T  —  \,T  —  2, .. .  until  the  first  observation  in  period  t  =  1. 

Using  again  the  linearity  of  the  equations  and  the  normality  assumption,  we  get: 


P,\,F' 

pt+\\t. 


This  implies  that 

EpQlFr, . . . ,  Y„Xt+l)  =  XAt  +  PtitF'P;^lt(Xl+l -Xl+ll,). 

The  above  mean  is  only  conditional  on  all  information  available  up  to  time  t  and 
on  the  information  at  time  t  +  1 .  The  Markov  property  implies  that  this  mean  also 
incorporates  the  information  from  the  observations  Yt+ 1,...  ,  Yj.  Thus,  we  have: 


E(Xt|7j, . . . ,  YT,Xl+1)  =  E(Xt\Yu. . . ,  Y„Xt+l) 

=  Xtlt  +  PtltF'P^llt(Xt+l-Xt+llt) 

Applying  the  law  of  iterated  expectations  or  means  (see,  f.e.  Amemiya  1994;  p.  78), 
we  can  derive  X,\T\ 

X,\T  =  E(X,|F1; . . . ,  Yr)  =  E(E(Z,|FI, ....  F^+OIF,, . . . ,  Yr) 

=  E(XtU  +  PtUF'P^nt(Xt+l  -  Xf+1|f)|F1, . . . ,  Yt) 

=  XA,  +  PtltF'P^lt(Xt+liT-Xt+lu).  (17.10) 

The  algorithm  can  now  be  implemented  as  follows.  In  the  first  step  compute 
Xt-i\t  according  to  Eq.  (17.10)  as 

Xj—  1 1  j  =  Z7'_i|j’_i  +  Pt-i\t-xF’PtIt_x(Xt\t  —  Xj\t-\)- 

All  entities  on  the  right  hand  side  can  readily  be  computed  by  applying  the  Kalman 
filter.  Flaving  found  AY_||7  ,  we  can  again  use  Eq.  (17.10)  for  t  =  T  —  2  to  evaluate 
XT-2\T'- 


Xt-2\T  ~  XT-2\T-2  +  —  -^7’— 1 1 r— 2 ) - 

Proceeding  backward  through  the  sample  we  can  derive  a  complete  sequence  of 
smoothed  states  XT\T ,  XT-i\T ,  XT-2\t ,  •  ■  • ,  X\\T.  These  calculations  are  based  on  the 
computations  of  Xt\t,  Xt+\\t,  Pr\t,  and  P,+  \\,  which  have  already  been  obtained  from 
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the  Kalman  filter.  The  smoothed  covariance  matrix  P,\T  is  given  as  (see  Hamilton 
1994b;  Section  13.6): 

pt\T  =  Pt\t  +  pt\tFPt+\\APt+i\T  -  Pt+\\t)P,+\\,F' Pt\f 

Thus,  we  can  compute  also  the  smoothed  variance  with  the  aid  of  the  values  already 
determined  by  the  Kalman  filter. 


AR(1)  Process  with  Measurement  Errors  (Continued) 

We  continue  our  illustrative  example  of  an  AR(1)  process  with  measurement  errors 
and  just  two  observations.  First,  we  determine  the  filtered  values  for  the  state  vector 
with  the  aid  of  the  Kalman  filter.  To  initialize  the  process,  we  have  to  assign  a 
distribution  to  X(t.  For  simplicity,  we  assume  that  \<p\  <  1  so  that  it  makes  sense 
to  assign  the  stationary  distribution  of  the  process  as  the  distribution  for  X0: 


X0 


Then  we  compute  the  forecasting  step  as  the  first  step  of  the  filter  (see  Eq.  (17.6)): 


Xi|o  —  <pX0 10  —  0 


pi  |o  =  </>2 


+  O',) 


a 


2 

V 


1  -<p2 


Fi|o  =  0. 


Pqo  was  computed  by  the  recursive  formula  from  the  previous  section,  but  is,  of 
course,  equal  to  the  unconditional  variance.  For  the  updating  step,  we  get  from 
Eqs.  (17.8)  and  (17.9): 


These  two  results  are  then  used  to  calculate  the  next  iteration  of  the  algorithm. 
This  will  give  the  filtered  values  for  t  =  2  which  would  correspond  to  the  smoothed 
values  because  we  just  have  two  observations.  The  forecasting  step  is: 
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Next  we  perform  the  updating  step  to  calculate  X22  and  P2\2.  It  is  easy  to  verify  that 
this  leads  to  the  same  results  as  in  the  first  part  of  this  example. 

An  interesting  special  case  is  obtained  when  we  assume  that  <p  =  1  so  that  the 
state  variable  is  a  simple  random  walk.  In  this  case  the  unconditional  variance  of  X, 
and  consequently  also  of  Y,  are  no  longer  finite.  As  mentioned  previously,  we  can 
initialize  the  Kalman  filter  by  Zq|o  =  0  and  P0jo  =  oo.  This  implies: 


Tt|o  —  Xi|o  —  Xo|o  —  0 
p  1|0  =  A)|0  +  =  OO. 


Inserting  this  result  in  the  updating  Eqs.  (17.8)  and  (17.9),  we  arrive  at: 


(Pq  |o  +  Qyjo'w 
■Po|0  +  av  +  aw 


Letting  Pq|o  g°  to  infinity,  leads  to: 


This  shows  that  the  filtered  variance  is  finite  for  t  =  1  although  Pqo  was  infinite. 


1 7.3  Estimation  of  State  Space  Models 

Up  to  now  we  have  assumed  that  the  parameters  of  the  system  are  known  and 
that  only  the  state  is  unknown.  In  most  economic  applications,  however,  also  the 
parameters  are  unknown  and  have  therefore  to  be  estimated  from  the  data.  One  big 
advantage  of  the  state  space  models  is  that  they  provide  an  integrated  approach 
to  forecasting,  smoothing  and  estimation.  In  particular,  the  Kalman  filter  turns 
out  to  be  an  efficient  and  quick  way  to  compute  the  likelihood  function.  Thus,  it 
seems  natural  to  estimate  the  parameters  of  state  space  models  by  the  method  of 
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maximum  likelihood.  Kim  and  Nelson  (1999)  and  Durbin  and  Koopman  (2011) 
provide  excellent  and  extensive  reviews  of  the  estimation  of  state  space  models 
using  the  Kalman  filter. 

More  recently,  due  to  advances  in  computational  methods,  in  particular  with 
respect  to  sparse  matrix  programming,  other  approaches  can  be  implemented.  For 
example,  by  giving  the  states  a  matrix  representation  Chan  and  Jeliazkov  (2009) 
derive  a  viable  and  efficient  method  for  the  estimation  of  state  space  models. 

17.3.1  The  Likelihood  Function 

The  joint  unconditional  density  of  the  observations  (7(, . . . ,  Y'r)'  can  be  factorized 
into  the  product  of  conditional  densities  as  follows: 


f(Y\ . Yj )  =/(Fr|F1, . . . ,  YT-\)f(Y\, . . . ,  YT- 1) 

=  f{YT\Yu  YT-\)f(YT-\\Y\, . . . ,  7r_2)  •  ■  ./(F2|  70/(70 


Each  conditional  density  is  the  density  of  a  normal  distribution  and  is  therefore 
given  by: 


f(Y,\Yu  . . . ,  7f_0  =  (27r)~"/2(det  A,)~1/2 


where  A,  =  GP,\t-\ G'  +R.  The  Gaussian  likelihood  function  L  is  therefore  equal  to: 


-1/2 


Note  that  all  the  entities  necessary  to  evaluate  the  likelihood  function  are  provided 
by  the  Kalman  filter.  Thus,  the  evaluation  of  the  likelihood  function  is  a  byproduct 
of  the  Kalman  filter.  The  maximum  likelihood  estimator  (MLE)  is  then  given  by 
the  maximizer  of  the  likelihood  function,  or  more  conveniently  the  log-likelihood 
function.  Usually,  there  is  no  analytic  solution  available  so  that  one  must  resort 
to  numerical  methods.  An  estimation  of  the  asymptotic  covariance  matrix  can 
be  obtained  by  evaluating  the  Hessian  matrix  at  the  optimum.  Under  the  usual 
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assumptions,  the  MLE  is  consistent  and  delivers  asymptotically  normally  distributed 
estimates  (Greene  2008;  Amemiya  1994). 

The  direct  maximization  of  the  likelihood  function  is  often  not  easy  in  prac¬ 
tice,  especially  for  large  systems  involving  many  parameters.  The  expectation- 
maximization  algorithm,  EM  algorithm  for  short,  represents  a  valid,  though  slower 
alternative.  As  the  name  indicates,  it  consists  of  two  steps  which  have  to  be 
carried  out  iteratively.  Based  on  some  starting  values  for  the  parameters,  the  first 
step  (expectation  step)  computes  estimates,  Xt\T,  of  the  unobserved  state  vector  X, 
using  the  Kalman  smoother.  In  the  second  step  (maximization  step),  the  likelihood 
function  is  maximized  taking  the  estimates  of  X,,  X,  T,  as  additional  observations. 
The  treatment  of  X,\T  as  additional  observations,  allows  to  reduce  the  maximization 
step  to  a  simple  multivariate  regression.  Indeed,  by  treating  Xt\T  as  if  they  were 
known,  the  state  equation  becomes  a  simple  VAR(l)  which  can  be  readily  estimated 
by  linear  least-squares  to  obtain  the  parameters  F  and  Q.  The  parameters  A,  G  and 
R  are  also  easily  retrieved  from  a  regression  of  Y,  on  X,yr.  Based  on  these  new 
parameter  estimates,  we  go  back  to  step  one  and  derive  new  estimates  for  X,\T 
which  are  then  used  in  the  maximization  step.  One  can  show  that  this  procedure 
maximizes  the  original  likelihood  function  (see  Dempster  et  al.  1977;  Wu  1983).  A 
more  detailed  analysis  of  the  EM  algorithm  in  the  time  series  context  is  provided  by 
Brockwell  and  Davis  (1996). 11 

Sometimes  it  is  of  interest  not  only  to  compute  parameter  estimates  and  to  derive 
from  them  estimates  for  the  state  vector  via  the  Kalman  filter  or  smoother,  but  also 
to  find  confidence  intervals  for  the  estimated  state  vector  to  take  the  uncertainty 
into  account.  If  the  parameters  are  known,  the  methods  outlined  previously  showed 
how  to  obtain  these  confidence  intervals.  If,  however,  the  parameters  have  to  be 
estimated,  there  is  a  double  uncertainty:  the  uncertainty  from  the  filter  and  the 
uncertainty  arising  from  the  parameter  estimates.  One  way  to  account  for  this 
additional  uncertainty  is  by  the  use  of  simulations.  Thereby,  we  draw  a  given  number 
of  parameter  vectors  from  the  asymptotic  distribution  and  compute  for  each  of 
these  draws  the  corresponding  estimates  for  the  state  vector.  The  variation  in  these 
estimates  is  then  a  measure  of  the  uncertainty  arising  from  the  estimation  of  the 
parameters  (see  Hamilton  1994b;  Section  13.7). 


"The  analogue  to  the  EM  algorithm  in  the  Bayesian  context  is  given  by  the  Gibbs  sampler.  In 
contrast  to  the  EM  algorithm,  we  compute  in  the  first  step  not  the  expected  value  of  the  states,  but 
we  draw  a  state  vector  from  the  distribution  of  state  vectors  given  the  parameters.  In  the  second 
step,  we  do  not  maximize  the  likelihood  function,  but  draw  a  parameter  from  the  distribution  of 
parameters  given  the  state  vector  drawn  previously.  Going  back  and  forth  between  these  two  steps, 
we  get  a  Markov  chain  in  the  parameters  and  the  states  whose  stationary  distribution  is  exactly  the 
distribution  of  parameters  and  states  given  the  data.  A  detailed  description  of  Bayesian  methods 
and  the  Gibbs  sampler  can  be  found  in  Geweke  (2005).  Kim  and  Nelson  (1999)  discuss  this  method 
in  the  context  of  state  space  models. 
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17.3.2  Identification 

As  emphasized  in  Remark  17.5  of  Sect.  17.1,  the  state  space  representations  are 
not  unique.  See,  for  example,  the  two  alternative  representations  of  the  ARMA(1,1) 
model  in  Sect.  17.1.  This  non-uniqueness  of  state  space  models  poses  an  identi¬ 
fication  problem  because  different  specifications  may  give  rise  to  observationally 
equivalent  models.12  This  problem  is  especially  serious  if  all  states  are  unob¬ 
servable.  In  practice,  the  identification  problem  gives  rise  to  difficulties  in  the 
numerical  maximization  of  the  likelihood  function.  For  example,  one  may  obtain 
large  differences  for  small  variations  in  the  starting  values;  or  one  may  encounter 
difficulties  in  the  inversion  of  the  matrix  of  second  derivatives. 

The  identification  of  state  space  models  can  be  checked  by  transforming  them 
into  VARMA  models  and  by  investigating  the  issue  in  this  reparameterized  setting 
(Hannan  and  Deistler  1988).  Exercise  17.5.6  invites  the  reader  to  apply  this  method 
to  the  AR(1)  model  with  measurement  errors.  System  identification  is  a  special  field 
in  systems  theory  and  will  not  be  pursued  further  here.  A  systematic  treatment  can 
be  found  in  the  textbook  by  Ljung  (1999). 


17.4  Examples 

1 7.4.1  Disaggregating  Yearly  Data  into  Quarterly  Ones 

The  official  data  for  quarterly  GDP  are  released  in  Switzerland  by  the  State 
Secretariat  for  Economic  Affairs  (SECO).  They  estimate  these  data  taking  the  yearly 
values  provided  by  the  Federal  Statistical  Office  (FSO)  as  given.  This  division  of 
tasks  is  not  uncommon  in  many  countries.  One  of  the  most  popular  methods  for 
disaggregation  of  yearly  data  into  quarterly  ones  was  proposed  by  Chow  and  Lin 
(1971). 13  It  is  a  regression  based  method  which  can  take  additional  information 
in  the  form  of  indicator  variables  (i.e.  variables  which  are  measured  at  the  higher 
frequency  and  correlated  at  the  lower  frequency  with  the  variable  of  interest)  into 
account.  This  procedure  is,  however,  rather  rigid.  The  state  space  framework  is  much 
more  flexible  and  ideally  suited  to  deal  with  missing  observations.  Applications  of 
this  framework  to  the  problem  of  disaggregation  were  provided  by  Bernanke  et  al. 
(1997: 1)  and  Cuche  and  Hess  (2000),  among  others.  We  will  illustrate  this  approach 
below. 

Starting  point  of  the  analysis  are  the  yearly  growth  rates  of  GDP  and  indicator 
variables  which  are  recorded  at  the  quarterly  frequency  and  which  are  correlated 
with  GDP  growth.  In  our  application,  we  will  consider  the  growth  of  industrial  pro¬ 
duction  (IP)  and  the  index  on  consumer  sentiment  ( C)  as  indicators.  Both  variables 


12Remember  that,  in  our  context,  two  representations  are  equivalent  if  they  generate  the  same  mean 
and  covariance  function  for  {7,}. 

13Similarly,  one  may  envisage  the  disaggregation  of  yearly  data  into  monthly  ones  or  other  forms 
of  disaggregation. 
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are  available  on  a  quarterly  basis  from  1990  onward.  For  simplicity,  we  assume  that 
the  annualized  quarterly  growth  rate  of  GDP,  {Qt},  follows  an  AR(1)  process  with 
mean  /x: 


Qt-  F  =  <P(Qt-\  -  aO  +  wr,  W,  ~  WN(0,  a l) 

In  addition,  we  assume  that  GDP  is  related  to  industrial  production  and  consumer 
sentiment  by  the  following  two  equations: 

IP,  =  a  ip  +  PipQ,  +  Vipj 

C,  =  Otc  +  ficQt  +  Vc.t 

where  the  residuals  v,pj  and  Vc,t  are  uncorrelated.  Finally,  we  define  the  relation 
between  quarterly  and  yearly  GDP  growth  as: 

•ft  =  -Qt  +  -<2r- 1  +  2  +  3,  t  =  4,8,12... 

We  can  now  bring  these  equations  into  state  space  form.  Thereby  the  observation 
equation  is  given  by 


Y,  —  A,  +  G,X,  +  W, 


with  observation  and  state  vectors 


t=  4,8, 12,...; 


4,8,12,... 


<  Qt-  P  ^ 
Qt- 1  -  /4 
a-2  -  p 
\Qt- 3  -  ft/ 


and  time- varying  coefficient  matrices 
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A, 


t  =  4,8,12,...; 

f  ^  4,  8,  12, . . . 


■  l  l  l  i\ 

P,r  0  0  0  I  , 
Pc  0  0  0 / 

0  0  0  0\ 
Pip  0  0  0  , 
Pc  0  0  0 / 


t  =  4,8,12,...; 


4,8,12,... 


/o  o  0  \ 

Oct7>  0  1  ,  f  =  4,8, 12,...; 

\o  o 

/!  0  0\ 

I  0  afp  0  ,  4,8, 12,... 

\0  0  a2c) 


The  state  equation  becomes: 


xr+1  =  fx,  +  y,+i 


where 


(<p  000\ 

F=  1000 
0  10  0 

\o  o i o/ 

(ol  0  0  0\ 

0  000 

y “  0  000 

Vo  ooo / 

On  my  homepage  http://www.neusser.ch/  you  will  find  a  MATLAB  code  which 
maximizes  the  corresponding  likelihood  function  numerically.  Figure  17.3  plots  the 
different  estimates  of  GDP  growth  and  compares  them  with  the  data  released  by 
State  Secretariat  for  Economic  Affairs  (SECO). 
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Fig.  17.3  Estimates  of  quarterly  GDP  growth  rates  for  Switzerland 


1 7.4.2  Structural  Time  Series  Analysis 

A  customary  practice  in  business  cycle  analysis  is  to  decompose  a  time  series  into 
several  components.  As  an  example,  we  estimate  a  structural  time  series  model 
which  decomposes  a  times  series  additively  into  a  local  linear  trend,  a  business 
cycle  component,  a  seasonal  component,  and  an  irregular  component.  This  is  the 
specification  studied  as  the  basic  structural  model  (BSM)  in  Sect.  17.1.1.  We 
carry  over  the  specification  explained  there  to  apply  it  to  quarterly  real  GDP  of 
Switzerland.  Figure  17.4  shows  the  smoothed  estimates  of  the  various  components. 
In  the  left  upper  panel  the  demeaned  logged  original  series  (see  Fig.  17.4a)  is 
plotted.  One  clearly  discern  the  trend  and  the  seasonal  variations.  The  right  upper 
panel  shows  the  local  linear  trend  (LLT).  As  one  can  see  the  trend  is  not  a 
straight  line,  but  exhibits  pronounced  waves  of  low  frequency.  The  business  cycle 
component  showed  in  Fig.  17.4c  is  much  more  volatile.  The  large  drop  of  about 
2.5  %  in  2008/09  corresponds  to  the  financial  markets.  The  lower  right  panel 
plots  the  seasonal  component  (see  Fig.  17.4d).  From  a  visual  inspections,  one 
can  infer  that  the  volatility  of  the  seasonal  component  is  much  larger  than  the 
cyclical  component  (compare  the  scale  of  the  two  components)  so  that  movements  in 
GDP  are  dominated  by  seasonal  fluctuations.14  Moreover,  the  seasonal  component 
changes  its  character  over  time. 


14The  irregular  component  which  is  not  shown  has  only  very  small  variance. 
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a  b 


Fig.  17.4  Components  of  the  basic  structural  model  (BSM)  for  real  GDP  of  Switzerland,  (a) 
Logged  Swiss  GDP  (demeaned),  (b)  Local  linear  trend  (LLT).  (c)  Business  cycle  component,  (d) 
Seasonal  component 


17.5  Exercises 

Exercise  17.5.1.  Consider  the  basic  structural  time  series  model  for  {Y,}: 


Y,  =  T,  +  Wt, 

W,  - 

~  WN(0,o£) 

T,  =  5f_i  +  Tt-\  +  st. 

St  - 

-  WN(0,  a;) 

St  =  St- 1  +  £ t , 

ft- 

-  WN(0,  of) 

where  the  error  terms  Wt,  et  and  are  all  uncorrelated  with  other  at  all  leads 
and  lags. 

(i)  Show  that  {Yt}  follows  an  ARIMA(0,2,2)  process. 

(ii)  Compute  the  autocorrelation  function  of{A2Yt}. 


Exercise  17.5.2.  If  the  cyclical  component  of  the  basic  structural  model  for  {Y,}  is: 


( Ct  \  _  (  cos  Ac 

\C;J  ~P\-sinX c 
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//"A 

where  {  V\  ,  }  and  {V,  ,  }  are  mutually  uncorrelated  white-noise  processes. 

(i)  Show  that  {C,}  follows  an  ARMA(2,1)  process  with  ACF  given  by 

Yh(h)  =  ph  cos  Ac/z. 

Exercise  17.5.3.  Write  the  ARMA(p,q)  process  Y,  =  (j>  \  X,-  \  +  . . .  +  (ppX,-p  +Z,  + 
9\Z,-\  + . . .  +  QqZ,-q  as  a  state  space  model  such  that  the  state  vector  X,  is  given  by: 

(  Y,  \ 

Yt- 1 


\Zt-qJ 


Exercise  17.5.4.  Show  that  X,  and  Y,  have  a  unique  stationary  and  causal  solution 
if  all  eigenvalues  of  F  are  absolutely  strictly  smaller  than  one.  Use  the  results  from 
Sect.  12.3. 

Exercise  17.5.5.  Find  the  Kalman  filter  equations  for  the  following  system: 

X,  =  <f>X,-.\  +  w, 

Y,  =  XX,  +  v, 

where  X  and  (f>  are  scalars  and  where 

Exercise  17.5.6.  Consider  the  state  space  model  of  an  AR{  1 )  process  with  mea¬ 
surement  error  analyzed  in  Sect.  1 7.2: 

^r+l  =  4>X,  +  Df+i, 


Y ,  —  X,  +  Wt, 


For  simplicity  assume  that  \(p\  <  1. 


v,  ~  IIDN(0,  of) 

w,  ~  IIDN(0,cth2). 
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(i)  Show  that  {Y,}  is  an  ARMA(  1,1) process  given  by  Y,  —  (pY,-\  =  Z,  +  9Zt- \  with 
Z,  ~  WN(0,  ct|). 

(ii)  Show  that  the  parameters  of  the  state  space,  and  those  of  the 

ARMA(1,1 )  model  are  related  by  the  equation 

Qol  =  -fol 

1  =  -4>°l 

\  +  62  a2  +  (1  +  <j>2)a2 

(iii)  Why  is  there  an  identification  problem? 


Generalizations  of  Linear  Time  Series  Models  18 


Autoregressive  moving-average  models  have  become  the  predominant  approach  in 
the  analysis  of  economic,  especially  macroeconomic  time  series.  The  success  of 
these  parametric  models  is  due  to  a  mature  and  by  now  well-understood  statistical 
theory  which  has  been  the  subject  of  this  book.  The  main  assumption  behind  this 
theory  is  its  linear  structure.  Although  convenient,  the  assumption  of  a  constant 
linear  structure  turned  out  to  be  unrealistic  in  many  empirical  applications.  The 
evolution  of  economies  and  the  economic  dynamics  are  often  not  fully  captured  by 
constant  coefficient  linear  models.  Many  time  series  are  subject  to  structural  breaks 
which  manifest  themselves  as  a  sudden  change  in  the  model  coefficients  by  going 
from  one  period  to  another.  The  detection  and  dating  of  such  structural  breaks  is 
the  subject  of  Sect.  18.1.  Alternatively,  one  may  think  of  the  model  coefficients  as 
varying  over  time.  Such  models  have  proven  to  be  very  flexible  and  able  to  generate 
a  variety  of  non-linear  features.  We  present  in  Sects.  18.2  and  18.3  two  variants  of 
such  models.  In  the  first  one,  the  model  parameters  vary  in  a  systematic  way  with 
time.  They  are,  for  example,  following  an  autoregressive  process.  In  the  second 
one,  the  parameters  switch  between  a  finite  number  of  states  according  to  a  hidden 
Markov  chain.  These  states  are  often  identified  as  regimes  which  have  a  particular 
economic  meaning,  for  example  as  booms  and  recessions.  Further  parametric  and 
nonparametric  methods  for  modeling  and  analyzing  nonlinear  time  series  can  be 
found  in  Fan  and  Yao  (2003). 


18.1  Structural  Breaks 

There  is  a  extensive  literature  dealing  with  the  detection  and  dating  of  structural 
breaks  in  the  context  of  time  series.  This  literature  is  comprehensibly  summarized 
in  Perron  (2006),  among  others.  A  compact  account  can  also  be  found  in  Aue 
and  Horvath  (2011)  where  additional  testing  procedures,  like  the  CUSUM  test,  are 
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presented.  In  this  short  exposition  we  follow  Bai  et  al.  (1998)  and  focus  on  Chow 
type  test  procedures.  For  the  technical  details  the  interested  reader  is  referred  to 
these  papers. 


18.1.1  Methodology 

Consider,  for  the  ease  of  exposition,  a  VAR(l)  process  which  allows  for  a  structural 
break  at  some  known  date  tb : 

X,  =  d,(th)  (c(1)  +  <F(1)Xt_i)  +  (1  -  d,(tb ))  (c(2)  +  0(2)^_r)  +  Z,  (18.1) 


where 


d,(tb)  — 


1.  t  <tb; 
0,  t  >  tb. 


Thus,  before  time  tb  the  coefficients  of  the  VAR  process  are  given  by  c(  1  1  and  <F(1) 
whereas  after  tb  they  are  given  by  c^2)  and  <F(2) .  The  error  process  {Z,}  is  assumed  to 
be  IID(0,  E)  with  E  positive  definite.1  Suppose  further  that  the  roots  of  4>(l,(z)  as 
well  as  those  of  <J>(21  (z)  are  outside  the  unit  circle.  The  process  therefore  is  stationary 
and  admits  a  causal  representation  with  respect  to  {Z,}  before  and  after  date  tb. 

The  assumption  of  a  structural  break  at  some  known  date  tb  can  then  be 
investigated  by  testing  the  hypothesis 

H0  :  c(1)  =  c(2)  and  <F(1)  =  <J>(2)  against  Hj  :  c(1)  7^  c(2)  or  <F(1)  7^  d>(2). 

The  standard  way  to  test  such  a  hypothesis  is  via  the  F-statistic.  Given  a  sample 
ranging  from  period  0  to  period  T,  the  strategy  is  to  partition  all  variables 
and  matrices  along  the  break  date  tb.  Following  the  notation  and  the  spirit  of 
Sect.  13.2,  define  Y  =  vec(T(1),  F(2))  where  F(1)  =  (X\,X2, . . .  ,Xtb)  and  T(2)  = 


(Xth+\,X,h+2,  . 

.,XT),Z  = 

(Zi,Z2,. 

. ,  Zy  ),  and 

( 1  Xi'O  •• 

■  X„,o  \ 

n  xUb  . 

•  x,Uh  \ 

x(1)  = 

1  xu  .. 

■  XnJ 

X<2)  = 

1  X\A+i  . 

•  Xn,tb+ 1 

\lXlA-t  •• 

■  xnJh-J 

(1  X\T-\  • 

•  Xnj-\  ) 

'Generalization  to  higher  order  VAR  models  is  straightforward.  For  changes  in  the  covariance 
matrix  S  see  Bai  (2000).  For  the  technical  details  the  reader  is  referred  to  the  relevant  literature. 
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then  the  model  (18.1)  can  be  written  as 


Y  =  vec(r(1),  y(2))  = 


X 


The  least- squares  estimator  becomes 


P  =  vec  (c(1),$(1),£(2),$(2))  =  (X'Xr'X'Y 

_  /((xw'xw^xw')  ® /„  0 


0 


vec  i 


:(F(1),  r<2>). 


((xw'xwj-'xw')  ®  in 


This  amounts  to  estimate  the  model  separately  over  the  two  sample  periods.  Note 
that  as  in  Sect.  13.2  the  GLS  estimator  is  numerically  identical  to  the  OLS  estimator 
because  the  same  regressors  are  used  for  each  equation.  The  corresponding  Wald- 
test  can  be  implemented  by  defining  R  =  (Ini+n,  —Ini+n)  and  computing  the 
F- statistic 


squares  residuals  Z,  as  Y.,h  =  j  '}Z't=\  Z tZ \  given  break  date  t/,.  Under  the  standard 
assumptions  made  in  Sect.  13.2,  the  test  statistic  F(f/,)/(n2  +  n)  converges  for 
T  — >  oo  to  a  chi-square  distribution  with  rr  +  n  degrees  of  freedom.2  This  test 
is  known  in  the  literature  as  the  Chow  test. 

The  previous  analysis  assumed  that  the  potential  break  date  tb  is  known.  This 
assumption  often  turns  out  to  be  unrealistic  in  practice.  The  question  then  arises  how 
to  determine  a  potential  break  date.  Quandt  (1960)  proposed  a  simple  procedure: 
compute  the  Chow-test  for  all  possible  break  dates  and  take  as  a  candidate  break 
date  the  date  where  the  F-statistic  reaches  its  maximal  value.  Despite  its  simplicity, 
Quandt’s  procedure  could  not  be  implemented  coherently  because  it  was  not  clear 
which  distribution  to  use  for  the  construction  of  the  critical  values.  This  problem 
remained  open  for  more  than  thirty  years  until  the  contribution  of  Andrews  (1993). 3 
Denote  by  |xj  the  value  of  *  rounded  to  the  nearest  integer  towards  minus  infinity, 
then  the  maximum  Wald  statistic  and  the  logarithm  of  the  Andrews  and  Ploberger 
(1994)  exponential  Wald  statistic  can  be  written  as  follows: 


2 As  the  asymptotic  theory  requires  that  t^/T  does  not  go  to  zero,  one  has  to  assume  that  both  the 
number  of  periods  before  and  after  the  break  go  to  infinity. 

3  A  textbook  version  of  the  test  can  be  found  in  Stock  and  Watson  (201 1). 
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supF  : 

sup  FflTrJ) 

t€(t*,1— r*) 

r  l-r* 

expF  : 

log  j  exp(iF(LrrJ))dr 

where  r*  denotes  the  percentage  of  the  sample  which  is  trimmed.  Usually,  r*  takes 
the  value  of  0. 1 5  or  0. 1 0.  Critical  values  for  low  degrees  of  freedom  are  tabulated  in 
Andrews  (1993,  2003)  and  Stock  and  Watson  (2011).  It  is  possible  to  construct  an 
asymptotic  confidence  interval  for  the  break  date.  The  corresponding  formulas  can 
be  found  in  Bai  et  al.  (1998;  p.  401-402). 


18.1.2  An  Example 

The  use  of  the  structural  break  test  is  demonstrated  using  historical  data  for  the 
United  Kingdom.  The  data  consist  of  logged  per  capita  real  GDP,  logged  per 
capita  real  government  expenditures,  logged  per  capita  real  government  revenues, 
the  inflation  based  on  the  consumer  price  index,  and  a  long-term  interest  rate 
over  a  sample  period  from  1830  to  2003.  The  basis  for  the  analysis  consists  of 
a  five  variable  VAR(2)  model  including  a  constant  term  and  a  linear  trend.  Three 
alternative  structural  break  modes  are  investigated:  break  in  the  intercept,  break 
in  the  intercept  and  the  time  trend,  and  break  in  all  coefficients,  including  the 
VAR  coefficients.  The  corresponding  F-statistics  are  plotted  in  Fig.  18.1  against 
all  possible  break  dates  allowing  for  a  trimming  value  of  10  %.  The  horizontal  lines 
show  for  all  three  alternative  break  modes  the  corresponding  critical  values  for  the 
supF  test  given  5  %  significance  levels.  These  critical  values  have  been  obtained 
from  Monte  Carlo  simulations  as  in  Andrews  (1993,  2003)  and  are  given  as  18.87, 
28.09,  and  97.39.4 

Figure  18.1  shows  that  for  all  three  modes  a  significant  structural  break  occurs. 
The  corresponding  values  of  the  supF  statistics  are  78.06,  104.75,  and  285.22.  If 
only  the  deterministic  parts  are  allowed  to  change,  the  break  date  is  located  in  1913. 
If  all  coefficients  are  allowed  to  change,  the  break  is  dated  in  1968.  However,  all 
three  F-statistics  show  a  steep  increase  in  1913.  Thus,  if  only  one  break  is  allowed 
1913  seems  to  be  the  most  likely  one.5  The  breaks  are  quite  precisely  dated.  The 
corresponding  standard  errors  are  estimated  to  be  two  years  for  the  break  in  the 
intercept  only  and  one  year  for  the  other  two  break  modes. 


4Assuming  a  trimming  value  of  0.10  Andrews  (2003;  table  I)  reports  critical  values  of  18.86  for 
p  =  5  which  corresponds  to  changes  in  the  intercept  only  and  27.27  forp  =  10  which  corresponds 
to  changes  in  intercept  and  time  trend. 

5See  Perron  (2006)  for  a  discussion  of  multiple  breaks. 
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time 

Fig.  18.1  Analysis  of  breaks  dates  with  the  supF  test  statistic  for  historical  UK  time  series 


18.2  Time-Varying  Parameters 

This  section  discusses  time-varying  coefficient  vector  autoregressive  models  (TVC- 
VAR  models).  This  model  class  retains  the  flavor  of  VAR  models  but  assumes  that 
they  are  only  valid  locally.  Consider  for  this  purpose  a  VAR(l)  model  with  time- 
varying  autoregressive  coefficients  <I>, : 

Xt+i  =  <5>tXt  +  Zt+ 1,  Z,  ~  IID(0,  £)  with  £  >  0,  t  e  Z.  (18.3) 

This  model  can  be  easily  generalized  to  higher  order  VAR’s  (see  below)  or, 
alternatively,  one  may  think  of  Eq.  (18.3)  as  a  higher  order  VAR  in  companion 
form.  The  autoregressive  coefficient  matrix  is  assumed  to  be  stochastic.  Thus,  (\>t 
is  a  random  n  x  n  matrix.  Models  of  this  type  have  been  widely  discussed  in  the 
probabilistic  literature  because  they  arise  in  many  diverse  contexts.  In  economics, 
Eq.  (18.3)  can  be  interpreted  as  the  probabilistic  version  describing  the  value  of  a 
perpetuity,  i.e.  the  present  discounted  value  of  a  permanent  commitment  to  pay  a 
certain  sum  each  period.  Thereby  Z,  denotes  the  random  periodic  payments  and  <f>, 
the  random  cumulative  discount  factors.  The  model  also  plays  an  important  role  in 
the  characterization  of  the  properties  of  volatility  models  as  we  have  seen  in  Sect.  8. 1 
(see  in  particular  the  proofs  of  Theorems  8.1  and  8.3).  In  this  presentation,  the  above 
model  is  interpreted  as  a  locally  valid  VAR  process. 

A  natural  question  to  ask  is  under  which  conditions  Eq.  (18.3)  admits  a  stationary 
solution.  An  answer  to  this  question  can  be  found  by  iterating  the  equation 
backwards  in  time: 
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X,  =  $,-iX,_i  +  Z,  =  $,_1($,_2*,-2  +  Z,_i)  +  Z, 

=  Z,  +  0,-1  Z,_i  +  $,-i  $,-2^,-2 
=  Z,  +  <J>,_iZ,_i  +  <&,_i$r_2Z,_2  +  $/-i$r-2cE>/-3Zf_3 


k  /  7 


/Jfc+1 


E  (fl  °-'J  ZH  +  (  n  )  Xt~k~u  k  =  0.1,2,... 


Ki=  1 


where  it  is  understood  that  ]"[?_i  =  /„.  This  suggests  as  a  solution  candidate 


7=0  \i=l  / 

=  Z,  +  <5,_iZ,_i  +  $,-l$;-2Z,_2  +  <5,-l$,-2$r-3Z,_3  +  .  .  .  (18.4) 


Based  on  results  obtained  by  Brandt  (1986)  and  extended  by  Bougerol  and  Picard 
(1992b),  we  can  cite  the  following  theorem. 

Theorem  18.1  (Solution  TVC-VAR(l)).  Let  {(<!>,, Z,)}  be  a  strictly  stationary 
ergodic  process  such  that 

(i)  E(log+  ||$,  ||)  <  oo  ont/E(log+  ||Z,||)  <  oo  where  x+  denotes  max{jc,  0}; 

( ii)  the  top  Lyapounov  exponent  y  defined  as 

Y  =  inf  I  E  f  — l—  log  || $0*5-1  ■  ■  •  *5-n 
n€M  )  \  n  +  1 

is  strictly  negative. 

Then  Xt  as  defined  in  Eq.  (18.4)  converges  a.s.  and  {X,}  is  the  unique  strictly 
stationary  solution  of  equation  (18.3). 

Remark  18.1.  The  Lyapunov  exponent  measures  the  rate  of  separation  of  nearly 
trajectories  in  a  dynamic  system.  The  top  Lyapunov  exponent  gives  the  largest  of 
these  rates.  It  is  used  to  characterize  the  stability  of  a  dynamic  system  (see  Colonius 
and  Kliemann  (2014)). 

Remark  18.2.  Although  Theorem  18.1  states  only  sufficient  conditions,  these 
assumptions  can  hardly  be  relaxed. 

The  solution  (18.4),  if  it  exists,  is  similar  to  a  causal  representation.  The  matrix 
sequence  {[]?=!  <5/+A-i}A=o,i,2I...  =  {4,  <5,,  $,+  i$„  $,+2$,+i$r, . . .}  represents 

the  effect  of  an  impulse  in  period  t  to  X,+h,  h  =  0,  1,2 _ and  can  therefore 

be  interpreted  as  impulse  response  functions.  In  contrast  to  the  impulse  response 
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functions  studied  so  far,  they  are  clearly  random  and  time-dependent  because  the 
effect  of  Z,  depends  on  future  coefficients.  In  particular,  the  effect  of  Z,  on  Xt+h, 
h  >  1,  is  not  the  same  as  the  effect  of  Z,_/,  on  Xt.  Nevertheless  it  is  possible 
to  construct  meaningful  impulse  response  functions  by  Monte  Carlo  simulations. 
One  may  then  report  the  mean  of  the  impulse  responses  or  some  quantiles  for 
different  time  periods.6  Alternatively,  one  may  ignore  the  randomness  and  time- 

dependency  and  define  “local”  impulse  responses  as  h  =  0,  1,2, _ Note, 

however,  that  the  impulse  responses  so  defined  still  vary  with  time.  Irrespectively 
how  the  impulse  responses  are  constructed,  they  can  be  interpreted  in  the  same 
way  as  in  the  case  of  constant  coefficients.  In  particular,  we  may  use  some  of  the 
identification  schemes  discussed  in  Chap.  15  and  compute  the  impulse  responses 
with  respect  to  structural  shocks.  Similar  arguments  apply  to  the  forecast  error 
variance  decomposition  (FEVD). 

The  model  is  closed  by  fixing  the  law  of  motion  for  <Jy  As  already  mentioned 
in  Sect.  17.1.1  there  are  several  possibilities.  In  this  presentation  we  adopt  the 
following  flexible  autoregressive  specification: 

Pt+i-P  =  F(P,-j3)  +  Vt+i  V,  ~  WN(0,  Q)  (18.5) 

where  f}t  =  vec  <t>,  denotes  the  n2  vector  of  stacked  coefficients.  Q  is  assumed  to  be 
fixed  and  is,  usually,  specified  as  a  diagonal  matrix.  If  the  eigenvalues  of  F  are  inside 
the  unit  circle,  the  autoregressive  model  is  mean-reverting  and  fi  can  be  interpreted 
as  the  average  coefficient  vector.  The  formulation  in  Eq.  (18.5)  is,  however,  not 
restricted  to  this  case  and  allows  explicitly  the  possibility  that  j  fi,\  follows  a  random 
walk.  This  specification  has  become  very  popular  in  the  empirical  macroeconomic 
literature  and  was  initially  adopted  by  Cogley  and  Sargent  (2001)  to  analyze  the 
dynamics  of  inflation  across  different  policy  regimes.7 

The  model  consisting  of  Eqs.  (18.3)  and  (18.5)  can  be  easily  reformulated  as  a 
state  space  model  by  defining  =  /},  —  /5  as  the  state  vector.  The  state  and  the 
measurement  equation  can  then  be  written  as: 

state  equation:  £,+1  =  F%,  +  Vt+\  (18.6) 

measurement  equation:  X,  =  (X't_x  0  In)fi  +  {X't_x  0  /„)£,  +  Zt.  (18.7) 

Conditional  on  initial  values  for  the  coefficients  and  their  covariances,  the  state 
space  model  can  be  estimated  by  maximum  likelihood  by  applying  the  Kalman 
filter  (see  Sect.  17.3  and  Kim  and  Nelson  (1999)).  One  possibility  to  initialize  the 
Kalman  filter  is  to  estimate  the  model  for  some  initial  sample  period  assuming  fixed 
coefficients  and  extract  from  these  estimates  the  corresponding  starting  values. 


6Potter  (2000)  discusses  the  primal  problems  of  defining  impulse  responses  in  a  nonlinear  context. 

7They  allow  for  a  correlation  between  V,  and  Z,. 
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As  it  turns  out,  allowing  time- variation  only  in  the  coefficients  of  the  VAR  model 
overstates  the  role  attributed  to  structural  changes.  We  therefore  generalize  the 
model  to  allow  for  time-varying  volatility.  More  specifically,  we  also  allow  E  in 
Eq.  (18.3)  to  vary  with  time.  The  modeling  of  the  time-variation  in  E  is,  however, 
not  a  straightforward  task  because  we  must  ensure  that  in  each  period  E,  is  a 
symmetric  positive  definite  matrix.  One  approach  is  to  specify  a  process  especially 
designed  for  modeling  the  dynamics  of  covariance  matrices.  This  so-called  Wishart 
autoregressive  process  was  first  introduced  to  economics  by  Gourieroux  et  al.  (2009) 
and  successfully  applied  by  Burren  and  Neusser  (2013).  It  leads  to  a  nonlinear  state 
space  system  which  can  be  estimated  with  the  particle  filter,  a  generalization  of  the 
Kalman  filter. 

Another  more  popular  approach  was  initiated  by  Cogley  and  Sargent  (2005) 
and  Primiceri  (2005).  It  is  based  on  the  Cholesky  factorization  of  the  time-varying 
covariance  matrix  E,.  Using  the  same  notation  as  in  Sect.  15.3  E,  is  decomposed  as 

E,  =  B,Q.tB't  (18.8) 

where  B,  is  a  time- varying  lower  triangular  matrix  with  ones  on  the  diagonal  and  Q, 
a  time- varying  diagonal  matrix  with  strictly  positive  diagonal  elements.8  The  logged 
diagonal  elements  of  Q.,  are  then  assumed  to  evolve  as  independent  univariate 
random  walks.  This  specification  can  be  written  in  matrix  terms  as 

exp (D,)  (18.9) 

where  D,  is  a  diagonal  matrix  with  diag  (/.),)  ~  WN(0,  G/j).  In  the  above 
formulation  exp  denotes  the  matrix  exponential.9  Taking  the  matrix  logarithm,  we 
get  exactly  the  formulation  of  Cogley  and  Sargent  (2005)  and  Primiceri  (2005).  For 
the  time  evolution  of  B,  we  propose  a  similar  specification: 


Bt  =  B,~\  exp(Cf)  (18.10) 

where  C,  is  a  strictly  lower  triangular  matrix,  i.e.  C,  is  a  lower  triangular  matrix  with 
zeros  on  the  diagonal.  The  non-zero  entries  of  C,,  denoted  by  [C,]r>J,  are  assumed  to 
follow  a  multivariate  white  noise  process  with  diagonal  covariance  matrix  Eg,  i.e. 
[C?]i>y  ~  WN(0,  E b)-  It  can  be  shown  that  the  matrix  exponential  of  strictly  lower 
triangular  matrices  are  triangular  matrices  with  ones  on  the  diagonal.  As  the  set  of 
triangular  matrices  with  ones  on  the  diagonal  form  a  group,  called  the  unipotent 
group  and  denoted  by  SLT„,  the  above  specification  is  well-defined.  Moreover,  this 
formulation  is  a  very  natural  one  as  the  set  of  strictly  lower  triangular  matrices 


8It  is  possible  to  consider  other  short-run  type  identification  schemes  (see  Sect.  15.3)  than  the 
Cholesky  factorization. 

9The  matrix  exponential  of  a  matrix  A  is  defined  as  exp(A)  =  XlSo  where  A  is  any  matrix. 
Its  inverse  log(A)  is  defined  only  for  ||A||  <  1  and  is  given  by  log(A)  =  - — ~ — A'. 
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is  the  tangent  space  of  SLT„  at  the  identity  (see  Baker  2002;  for  details).  Thus, 
Eq.  (18.10)  can  be  interpreted  as  a  log-linearized  version  of  Bt.  The  technique 
proposed  for  the  evolution  of  Bt  in  Eq.  (18.10)  departs  from  Primiceri  (2005)  who 
models  each  element  of  the  inverse  of  B ,  and  therefore  misses  a  coherent  system 
theoretic  approach.  See  Neusser  (2016)  for  details. 

Although  this  TVC-VAR  model  with  time-varying  volatility  can  in  principle  also 
be  estimated  by  maximum  likelihood,  this  technique  can  hardly  be  implemented 
successfully  in  practice.  The  main  reason  is  that  the  likelihood  function  of  such  a 
model,  even  when  the  dimension  and  the  order  of  the  VAR  is  low,  is  a  very  high 
dimensional  nonlinear  object  with  probably  many  local  maxima.  Moreover,  as  the 
variances  governing  the  time-variation  are  small,  at  least  for  some  of  the  coefficients, 
the  likelihood  function  is  flat  in  some  regions  of  the  parameter  space.  These  features 
make  maximization  of  the  likelihood  function  a  very  difficult,  if  not  impossible, 
task  in  practice.  For  these  reasons,  Bayesian  techniques  have  been  used  almost 
exclusively.  There  is,  however,  also  a  conceptional  issue  involved.  As  the  Bayesian 
approach  does  not  strictly  distinguish  between  fixed  “true”  parameters  and  random 
samples,  it  is  better  suited  to  handle  TVC-VAR  models  which  treat  the  parameters 
as  random.  In  this  monograph,  we  will  not  tackle  the  Bayesian  approach  but  refer 
to  the  relevant  literature.  See  for  example  Primiceri  (2005),  Negro  and  Primiceri 
(2015),  Cogley  and  Sargent  (2005),  Canova  (2007)  and  Koop  and  Korobilis  (2009) 
among  others. 


The  Minnesota  Prior 

Although,  we  will  not  discuss  the  Bayesian  approach  to  VAR  modeling,  it  is 
nevertheless  instructive  to  portray  the  so-called  Minnesota  prior  applied  by  Doan 
et  al.  (1984)  to  TVC-VAR  models.  This  prior  has  gained  some  reputation  in 
connection  to  forecasting  with  VAR  models  and  as  a  way  to  specify  the  initial 
distribution  for  the  Kalman  filter  in  time- varying  models.  The  combination  of  the 
prior  distribution  with  the  likelihood  function  delivers  via  Bayes’  rule  a  posterior 
distribution  of  the  parameters  which  can  then  be  analyzed  using  simulation  methods. 

The  Minnesota  prior  is  based  on  the  a  priori  belief  that  each  variable  follows  a 
random  walk  with  no  interaction  among  the  variables  nor  among  the  coefficients  of 
the  VAR  equations.  We  expose  one  version  of  the  Minnesota  prior  in  the  general 
context  of  a  TVC-VAR  model  of  order  p  with  time- varying  constant  term  c, : 

X,  =  C,  +  +  •  •  •  +  <5,-1  X,-p  +  Zr-  (18.1 1) 

This  model  can  be  written  compactly  as 


X,  —  (X;_j  ®  /„)  vec  d>,_i  +  Z,  —  (Xj_j  ®  In)P,-i  +  Z, 


(18.12) 
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where  Xt_,  =  {\,X't_v  . . .  ,X'_p)',  =  (c„  . . . ,  and  P,  =  vec  <J>„ 

Assuming  for  P,  the  same  autoregressive  form  as  in  Eq.  (18.5),  the  state  space 
representation  (18.6)  and  (18.7)  also  applies  to  the  TVC-VAR(p)  model  with  X,~\ 
replaced  by  X,_i.  Note  that  the  dimension  of  the  state  equation  can  become  very 
high  because  P,  is  a  n  +  n2p  vector. 

Taking  date  0  as  the  initial  date,  the  prior  distribution  of  the  autoregressive 
parameters  is  supposed  to  be  normal: 

p0  =  vecO0  ~  N(/f,P0|o) 


where  P  =  vec(0,  /„,  0, . . . ,  0).  This  implies  that  the  mean  for  all  coefficients, 
including  the  constant  term,  is  assumed  to  be  zero  except  for  the  own  lag  coefficients 

of  order  one  [cE>q1)];1,  i  =  1 . n,  which  are  assumed  to  be  one.  The  covariance 

matrix  Po|o  is  taken  as  being  diagonal  so  that  there  is  no  correlation  across 
coefficients.  Thus,  the  prior  specification  amounts  to  assuming  that  each  variable 
follows  a  random  walk  with  no  interaction  with  other  variables. 

The  strength  of  this  belief  is  governed  by  a  number  of  so-called  hyperparameters 
which  regulate  the  diagonal  elements  of  P0|o-  The  first  one,  y2,  controls  the 
confidence  placed  on  the  assumption  that  [T^1],-,  =  1 : 

~N(1,]/2),  i  =  1, 2, . . . ,  n. 

A  small  (large)  value  of  y2  thus  means  more  (less)  confidence.  As  the  lag  order 
increases  more  confidence  is  placed  on  the  assumption  [cb(<),!)]„  =  0: 

[<^o')]  ~  N  ^0,  ,  h  =  2 . pandi= 


Instead  of  the  harmonic  decline  other  schemes  have  been  proposed.  For  h  =  !,..../> 
the  off-diagonal  elements  of  are  assumed  to  have  prior  distribution 


ij  =  1, ....  n,  i  ^  j,h  =  1,2 , ,p. 


Thereby  tf/y  represents  a  correction  factor  which  accounts  for  the  magnitudes  of 
Xu  relative  to  Xjt.  Specifically,  y  is  the  residual  variance  of  a  univariate  AR(1) 
model.  The  hyperparameter  w2  is  assumed  to  be  strictly  smaller  than  one.  This 
represents  the  belief  that  Xjt-h  is  less  likely  to  be  important  as  an  explanation  for 
Xi  t,  i  Y  j,  than  the  own  lag  X,- Finally,  the  strength  of  the  belief  that  the  constant 
terms  are  zero  is 


cm  =  N(  0,  gti). 
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This  completes  the  specification  for  the  prior  belief  on  fto.  Combining  all  elements 
we  can  write  P0|0  as  a  block  diagonal  matrix  with  diagonal  blocks: 


Po\o 


where  /J|)j(l)  =  g  x  diag(fi, . . . ,  r„)  and  =  diag(vec(G  <8>  T)).  Thereby,  G  and 
T  are  defined  as 


G  =  (y2,  y2/2, . . . ,  y2/p) 

[T1  ■  =  1  '  ’  ‘ 

1  hj  \w2(r2/r2),i^j. 


According  to  Doan  et  al.  (1984)  the  preferred  values  for  the  three  hyperparameters 
are  g  =  700,  y2  =  0.07,  and  w 2  =  0.01. 

Thus,  for  a  bivariate  TVC-VAR(2)  model  the  mean  vector  is  given  by 
/l  =  (0, 0, 1, 0,  0, 1, 0, 0,  0, 0)'  with  diagonal  covariance  matrix  P0i0: 


with 


and 


A)|o 


(pot  0  0  \ 

0  P™  0 

V  o  o  pMJ 


/ 1  0  0  0\ 

(1)  2  0  W2r22/r,2  0  0 

°l°  V  0  0  H,2f12/f|  0 

Vo  o  01/ 


/I  0  0  0\ 

p(2)  =T_  0w2f|/f,2  0  0 

°l°  2  0  0  vrx\lx\  0 

\0  0  0  1/ 
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Next  we  specify  the  parameters  of  the  state  transition  equation  (18.5).  Following 
Doan  et  al.  (1984),  F  =  7tfln+pn2  with  jtp  =  0.999  and  Q  =  :tqPo|o  with  kq  = 
10-7.  The  proportionality  factor  does,  however,  not  apply  to  the  constant  terms.  For 
these  terms,  the  corresponding  diagonal  elements  of  Q,  [Q\n,  i  =  I .....  n,  are  set 
to  ttq\P o|o]r(n+i),i(n+i),  i  =  1  , . . .  ,n.  The  reason  for  this  correction  is  that  the  prior 
put  on  the  constants  is  rather  loose  as  expressed  by  the  high  value  of  g.  The  final 
component  is  a  specification  for  X,  the  variance  of  Zt.  This  matrix  is  believed  to  be 
diagonal  with  X  =  diag(-fj2, . . . ,  r7)  and  tt-£  =  0.9. 

With  these  ingredients  the  state  space  model  is  completely  specified.  Given 
observations  Xi,...,X/;  the  Kalman  filter  produces  a  sequence  of 
t  =  1,2,...  and  one-period  ahead  forecasts  X,+i|,  computed  as 


-^/+l|/  —  ®  In)Pt+\\t- 


Doan  et  al.  (1984)  suggest  to  compute  an  approximate  h  period  ahead  forecast  by 
treating  the  forecast  from  the  previous  periods  as  if  they  were  actual  observations. 


1 8.3  Regime  Switching  Models 

The  regime  switching  model  is  similar  to  the  time-varying  model  discussed  in  the 
previous  section.  The  difference  is  that  the  time-varying  parameters  are  governed 
by  a  hidden  Markov  chain  with  a  finite  state  space  S  =  {1,2 Usually, 
the  number  of  states  k  is  small  and  is  equal  in  practice  to  two  or  maximal  three. 
The  states  have  usually  an  economic  connotation.  For  example,  if  k  equals  two, 
state  1  might  correspond  to  a  boom  phase  whereas  state  2  to  a  recession.  Such 
models  have  a  long  tradition  in  economics  and  have  therefore  been  used  extensively. 
Seminal  references  include  Goldfeld  and  Quandt  (1973,  1976),  Flamilton  (1994b), 
Kim  and  Nelson  (1999),  Krolzig  (1997),  and  Maddala  (1986).  Friihwirt-Schnatter 
(2006)  presents  a  detailed  statistical  analysis  of  regime  switching  models. 

The  starting  point  of  our  presentation  of  the  regime  switching  model  is  again  the 
TVC-VAR(l)  as  given  in  Eq.  (18.3).  We  associate  to  each  state  j  e  S  a  coefficient 
matrix  .  Thus,  in  the  regime  switching  model  the  coefficients  <f>,  can  only  assume 
a  finite  number  values  4>(1), . . . ,  d>(^  depending  on  the  state  of  the  Markov  chain. 
The  actual  value  assigned  to  <t>,  is  governed  by  a  Markov  chain  defined  through  a 
fixed  but  unknown  transition  probability  matrix  P  where 

[P]ij  =  p  ($,  =  3>0)|3>,-i  =  d>(/))  ij  =  1, . . . ,  k.  (18.13) 

Thus,  [P]ij  is  the  probability  that  d>,  assumes  value  <f>^)  given  that  it  assumed  in  the 
previous  period  the  value  The  probability  that  <&,+/,  is  in  state  j  given  that  4>; 
was  in  state  i  is  therefore  [Ph]ij.  The  definition  of  the  transition  matrix  in  Eq.  (18.13) 
implies  that  P  is  a  stochastic  matrix,  i.e.  that  [P],j  >  0  and  J2j=dPh  =  1- 
Moreover,  we  assume  that  the  chain  is  regular  meaning  that  it  is  ergodic  (irreducible) 
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and  aperiodic.10  This  is  equivalent  to  the  existence  of  a  fixed  integer  m  >  0 
such  that  P'"  has  only  strictly  positive  entries  (see  Berman  and  Plemmons  1994; 
Chapter  8).  Regular  Markov  chains  have  a  unique  ergodic  (stationary)  distribution 
vector  7r  with  strictly  positive  entries  and  determined  by  it'  P  =  jt'.  This  distribution 
is  approached  from  any  initial  distribution  vector  no,  i.e.  lim^oo  jt'{)P’  =  n' . 
Moreover,  limMOCl  P'  =  P°°  where  P°°  is  a  transition  matrix  with  all  rows  equal 
to  n' . 

Given  this  setup  we  can  again  invoke  Theorem  18.1  and  claim  that  a  (strictly) 
stationary  solution  of  the  form  of  Eq.  (18.4)  exists  if  all  the  autoregressive  matrices 
T >^\j  =  1,2 , ,k,  have  eigenvalues  strictly  smaller  than  one. 

Given  observations  Xj,xt-  i  , . . . ,  x\ ,  xq  for  Xj,Xj-  \ , .X\ .  Xo,  a  maximum 
likelihood  approach  can  be  set  up  to  estimate  the  unknown  parameters 
<5^, . . . ,  4>®,  E, P.n  Collect  these  parameters  into  a  vector  0  and  denote  by 
s,  eS  the  state  of  the  Markov  chain  in  period  t  and  by  X,  =  (xt,  xt-\ , ...  ,x\,  xo)  the 
information  available  up  to  period  t.  Write  the  conditional  density  of  x,  given  s,  =  j 
and  observations  X,-\  as 


f(x,\s,  =j,Xt-i;0). 
The  joint  density  of  (xt,  st  =  j )  is 


f(xt,  s,  =  j\ Xt- 1 ;  Q)  =  f(x,\s,  =  j,  X,-u  6)  x  P(s,  =  j\X,-u  6) 

where  in  analogy  to  the  Kalman  filter  the  expressions  P(st  =  j\Xt-\,0), 

j  =  1 _ ,k,  are  called  the  predicted  transition  probabilities.  The  conditional 

marginal  density  of  x,  then  becomes 

k 

f(x,\Xt-uO )  =  YJf(xt\st=j,Xt-i\e)xP{st  =j\X,-i\  9). 
j=  i 

In  the  case  of  Z,  ~  III)  A'(0,  E)  the  above  density  is  a  finite  mixture  of  Gaussian 
distributions  (see  Friihwirt-Schnatter  2006;  for  details).  The  (conditional)  log 
likelihood  function,  finally,  is  therefore  given  by 

T 

UO)  =  ^2  log/(x,| X,-u  9). 

t=  i 


10A  chain  is  called  ergodic  or  irreducible  if  for  every  states  i  and  j  there  is  a  strictly  positive 
probability  that  the  chain  moves  from  state  i  to  state  j  in  finitely  many  steps.  A  chain  is  called 
aperiodic  if  it  can  return  to  any  state  i  at  irregular  times.  See,  among  others,  Norris  (1998)  and 
Berman  and  Plemmons  (1994)  for  an  introduction  to  Markov  chains  and  its  terminology. 

11  The  presentation  of  the  maximum  likelihood  approach  follows  closely  the  exposition  by 
Hamilton  (1994b;  chapter  22)  where  more  details  can  be  found. 
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In  order  to  evaluate  the  likelihood  function  note  that  the  joint  density  of  (xt,  s,  =  j) 
may  also  be  factored  as 


f(x„s,  =  j\Xt-i  \  9)  =  P(s,  =  j\X,\  9)  y.f(xt\X,-i,  9). 


Combining  these  expressions  one  obtains  an  expression  for  the  filtered  transition 
probabilities  P(st  =  j\Xt;  9): 


P(s,=j\Xt;9) 


f(x,\s,  =j,X,-\\9)  xP(s,  =j\X,-i\9) 
f(xt\Xt-u  9) 

f(xt\s,  =  j.  Xt-1-,0)  X  P(Sl  =  j\Xt- 1;  9) 
E)=i/fc|s,  =  j,  X,-i\  9)  x  P(s,  =  j\Xt-\,  9) 


(18.14) 


Next  period’s  predicted  transition  probabilities  are  then  obtained  by  multiplication 
with  the  transition  matrix: 


(P(s,+  x  =  \\Xtm,  9)\  (P(s,  =  \\Xt;  9)\ 

=  P'  x 


\P(st+l  =  k\x,-  9)J 


\P(s,  =  k\X,\  9)j 


(18.15) 


Given  initial  probabilities  P(s\  =  j\XQ,9),j  =  1 . k,  and  a  fixed  value  for 

9,  Eqs.  (18.14)  and  (18.15)  can  be  iterated  forward  to  produce  a  sequence  of 
predicted  transition  probabilities  ( P(st  =  l|(bf_i;  9), . . .  ,P(st  =  k\Xt-\,  9))', 
t  =  1,2, ...  ,T  which  can  be  used  to  evaluate  the  Gaussian  likelihood  function. 
Numerical  procedures  must  then  be  used  for  the  maximization  of  the  likelihood 
function.  This  task  is  not  without  challenge  because  the  likelihood  function  of 
Gaussian  mixture  models  typically  has  singularities  and  many  local  maxima. 
Kiefer  (1978)  showed  that  there  exists  a  bounded  local  maximum  which  yields 
a  consistent  and  asymptotically  normal  estimate  for  9  for  which  standard  errors 
can  be  constructed  in  the  usual  way.  In  practice,  problems  encountered  during  the 
maximization  can  be  alleviated  by  experimentation  with  alternative  starting  values. 
Thereby  the  initial  probability  (/J(.V|  =  1| Xq]  9), . . . ,  P(si  =  k\X(,\  9)  could  either 
be  treated  as  additional  parameters  as  in  Goldfeld  and  Quandt  (1973)  or  set  to  the 
uniform  distribution.  For  technical  details  and  alternative  estimation  strategies,  like 
the  EM  algorithm,  see  Hamilton  (1994b;  chapter  22)  and  in  particular  Frtihwirt- 
Schnatter  (2006). 

By  reversing  the  above  recursion  it  is  possible  to  compute  smoothed  transition 
probabilities  P{s,  =  j\XT',  9)  (see  Kim  1994): 


k 

P(st  =  j\XT-  9)  =  P(s,  =  j\X,-  9)J2[P]ij 

i=l 


P(s,+ 1  =  i\Xr,  9) 
P{s,+i  =  i\X,;  9) 


The  iteration  is  initialized  with  P(s/  =  j\XT\  9)  which  has  been  computed  in  the 
forward  recursion. 
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The  basic  model  can  and  has  been  generalized  in  several  dimensions.  The  most 
obvious  one  is  the  inclusion  of  additional  lags  beyond  the  first  one.  The  second 
one  concerns  the  possibility  of  a  regime  switching  covariance  matrix  X.  These 
modifications  can  be  accommodated  using  the  methods  outlined  above.  Thirdly, 
one  may  envision  time-varying  transition  probabilities  to  account  for  duration 
dependence.  In  business  cycle  analysis,  for  example,  the  probability  of  moving  out 
of  a  recession  may  depend  on  how  long  the  economy  has  been  in  the  recession 
regime.  This  idea  can  be  implemented  by  modeling  the  transition  probabilities  via  a 
logit  specification: 


expQfo,-) 

1  +  exp  (z?t<Xi) 


i  +i 


where  Zt  includes  a  constant  and  a  set  of  additional  variables.  These  additional 
variables  can  be  some  exogenous  variables,  but  more  interestingly  may  include 
some  lagged  variables  xt-d  (Krolzig  1997).  Note  that  the  transition  probabilities  do 
not  only  depend  on  Zt,  but  also  on  the  state.  The  resulting  model  has  some  features 
shared  with  the  smooth  transition  autoregressive  model  of  Granger  and  Terasvirta 
(1993).  Early  economic  applications  of  regime  switching  models  with  time-varying 
transition  probabilities  can  be  found  in  Diebold  et  al.  (1994),  Filardo  (1994),  and 
Filardo  and  F. Gordon  (1998). 

An  important  aspect  in  practice  is  the  determination  of  the  number  of  regimes. 
Unfortunately,  there  is  no  direct  test  available  for  the  null  hypothesis  k  =  m 
against  the  alternative  k  =  m  +  1.  The  reason  is  that  the  likelihood  contains 
parameters  which  are  only  present  under  the  alternative.  The  parameters  describing 
the  m  +  1-th  state  are  unidentified  under  the  null  hypothesis.  The  problem  has 
been  analyzed  by  Andrews  and  Ploberger  (1994)  in  a  general  theoretical  context. 
Alternatively,  one  may  estimate  the  model  under  the  null  hypothesis  and  conduct 
a  series  of  specification  tests  as  proposed  by  Hamilton  (1996).  It  has  also  been 
suggested  to  use  the  information  criteria  like  AIC  and  BIC  to  determine  the  number 
of  regimes  (Friihwirt-Schnatter  2006;  p.  346-347): 


AIC  =  —21(d)  +  2  k(k-  1) 

BIC  =  —21(0)  +  log (T)k(k  -  1) 


where  k(k  —  1 )  are  the  free  parameters  in  the  transition  matrix  P. 


Complex  Numbers 


A 


The  simple  quadratic  equation  x2  +  1  =0  has  no  solution  in  the  field  of  real 
numbers,  R.  Thus,  it  is  necessary  to  envisage  the  larger  field  of  complex  numbers  C. 
A  complex  number  z  is  an  ordered  pair  (a,  b)  of  real  numbers  where  ordered  means 
that  we  regard  (a,  b)  and  (/;,  a)  as  distinct  if  a  ^  b.  Let  x  =  (a.  b)  and  y  =  (c,  d)  be 
two  complex  numbers.  Then  we  endow  the  set  of  complex  numbers  with  an  addition 
and  a  multiplication  in  the  following  way: 

addition:  x  +  y  =  ( a ,  b )  +  (c,  d)  =  (a  +  c,  b  +  d) 

multiplication:  xy  =  ( a ,  b)(c,  d)  =  (ac  —  bd ,  ad  +  be). 

These  two  operations  will  turn  €  into  a  field  where  (0, 0)  and  (1, 0)  play  the  role  of 
0  and  1 . 1  The  real  numbers  R  are  embedded  into  C  because  we  identify  any  a  e  R 
with  ( a ,  0)  e  C. 

The  number  /  =  (0, 1 )  is  of  special  interest.  It  solves  the  equation  x2  +  1  =  0, 
i.e.  i 1  =  —1.  The  other  solution  being  —  i  =  (0,  —1).  Thus  any  complex  number 
(a,  b)  may  be  written  as  (a.  b)  =  a  +  ib  where  a ,  b  are  arbitrary  real  numbers.2 


1  Substraction  and  division  can  be  defined  accordingly: 


(a.  b)  —  (c,  d)  =  (a  —  c,b  —  d) 


subtraction: 


c2  +  d2  /  0. 


division: 


2A  more  detailed  introduction  of  complex  numbers  can  be  found  in  Rudin  (1976)  or  any  other 
mathematics  textbook. 
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An  element  z  in  this  field  can  be  represented  in  two  ways: 

z=  a  +  ib  Cartesian  coordinates 

=  re'6  =  r( cos  9  +  i  sin  9)  polar  coordinates. 

In  the  representation  in  Cartesian  coordinates  a  =  Re(V)  =  :)i  (~)  is  called  the  real 
part  whereas  b  =  Im(z)  =  3(z)  is  called  the  imaginary  part  of  z. 

A  complex  number  z  can  be  viewed  as  a  point  in  the  two-dimensional  Cartesian 
coordinate  system  with  coordinates  ( a,b ).  This  geometric  interpretation  is  repre¬ 
sented  in  Fig.  A.l. 

The  absolute  value  or  modulus  of  z,  denoted  by  |z|,  is  given  by  r  =  a2  +  b2. 
Thus,  the  absolute  value  is  nothing  but  the  distance  of  z  viewed  as  a  point  in  the 
complex  plane  (the  two-dimensional  Cartesian  coordinate  system)  to  the  origin  (see 
Fig.  A.l).  9  denotes  the  angle  to  the  positive  real  axis  (x-axis)  measured  in  radians. 
It  is  denoted  by  9  =  arg  z.  It  holds  that  tan  9  =  Finally,  the  conjugate  of  z„ 
denoted  by  z,  is  defined  by  z  =  a  —  ib. 

Setting  r  =  1  and  9  =  jr,  gives  the  following  famous  formula: 

e'71  +  1  =  (cosjt  +  i  sin  it)  +  1  =  —  1  +  1  =  0. 

This  formula  relates  the  most  famous  numbers  in  mathematics. 


Fig.  A.l  Representation  of  a 
complex  number 


) - i - - i - 

'-2  -10  12 

real  part 
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From  the  definition  of  complex  numbers  in  polar  coordinates,  we  immediately 
derive  the  following  implications: 


Further  implications  are 
Fig.A.l): 


cos  9  = 
sin  9  = 


e'°  +  e - 
2 

e,e  —  e~ 

2i 


a 

r 

b 

r 


de  Moivre’s  formula  and  Pythagoras’  theorem  (see 


de  Moivre’s  formula  ( re ,0)"  =  rne"’6  =  r"(cosn9  +  i  sin n6) 

Pythagoras’  theorem  1  =  e,e  e~,e  =  (cos  9  +  i  sin  #)(cos  9  —  i  sin  9) 

=  cos2  9  +  sin2  9 


From  Pythagoras’  theorem  it  follows  that  r2  =  a2  +  b2.  The  representation  in  polar 
coordinates  allows  to  derive  many  trigonometric  formulas. 

Consider  the  polynomial  <t>(z)  =  (p0  —  (piz—cpiz2  — ...  —  4>pzp  of  order  p  >  1  with 
< po  =  l.3  The  fundamental  theorem  of  algebra  then  states  that  every  polynomial  of 
order  p  >  1  has  exactly  p  roots  in  the  field  of  complex  numbers.  Thus,  the  field 
of  complex  numbers  is  algebraically  complete.  Denote  these  roots  by  Ai, . . . ,  Xp, 
allowing  that  some  roots  may  appear  several  times.  The  polynomial  can  then  be 
factorized  as 


*(z)  =  (1  -  A r‘z)  (1  -  A J'z) ...  (1  -  A;7'z) . 

This  expression  is  well-defined  because  the  assumption  of  a  nonzero  constant 
(fo  =  1  f  0)  excludes  the  possibility  of  roots  equal  to  zero.  If  we  assume  that 

the  coefficients  <j)p  j  =  0 . p,  are  real  numbers,  the  complex  roots  appear  in 

conjugate  pairs.  Thus  if  z  =  a  +  ib,  b  0,  is  a  root  then  z  =  a  —  ib  is  also  a  root. 


3The  notation  with  instead  of  "cpjf”  was  chosen  to  conform  to  the  notation  of  AR-models. 
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Linear  difference  equations  play  an  important  role  in  time  series  analysis.  We  there¬ 
fore  summarize  the  most  important  results. 1  Consider  the  following  linear  difference 
equation  of  order  p  with  constant  coefficients.  This  equation  is  defined  by  the 
recursion: 


Xt  —  cpiXf-i  +  . . .  +  (f>pXt-p,  <pp  ^  0,  t  €  Z. 

Thereby  { X,  j  represents  a  sequence  of  real  numbers  and  <j>\, ...  ,<f>p  are  p  constant 
coefficients.  The  above  difference  equation  is  called  homogeneous  because  it 
involves  no  other  variable  than  X,.  A  solution  to  this  equation  is  a  function  F  : 
Z  — >  R  such  that  its  values  F(t)  or  Ft  reduce  the  difference  equation  to  an  identity. 

It  is  easy  to  see  that  if  {x\^}  and  {xj2)}  are  two  solutions  than  {ciX,(1)  +  C2X,<-)}, 
for  any  Ci ,  C2  G  R,  is  also  a  solution.  The  set  of  solutions  is  therefore  a  linear  space 
(vector  space). 

Definition  B.l.  A  set  of  solutions  { { xj 1  * } , ....  m  <  p,  is  called  linearly 

independent  if 

ciXf(1)  +  . . .  +  cmx\m)  =  0,  for  t  =  0, 1 , . . . , p  -  1 

implies  that  c\  =  . . .  =  cm  =  0.  Otherwise  we  call  the  set  linearly  dependent. 

Given  arbitrary  starting  values  xq,  . . .  ,xp-\  for  X0, ,  Xp_ \ ,  the  difference 
equation  determines  all  further  through  the  recursion: 

X,  =  +  . . .  +  <ppX,-p  t  =  p,p  +  1, . . . . 


*For  more  detailed  presentations  see  Agarwal  (2000),  Elaydi  (2005)  or  Neusser  (2009). 
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Similarly  for  X,  rnit  t  =  -1,-2 .  Suppose  we  have  p  linearly  independent 

solutions  {{X*1'}, . . . , {X,  }}  then  there  exists  exactly  p  numbers  c\, . . .  ,cp  such 
that  the  solution 


Xt  —  c iXf  *  +  C2X}  +  . . .  +  cpx\p ' 

is  compatible  with  arbitrary  starting  values  xq,  . . .  ,xp-\.  These  starting  values  then 
determine  uniquely  all  values  of  the  sequence  {V,}.  Thus  {X/}  is  the  only  solution 
compatible  with  starting  values.  The  goal  therefore  consists  in  finding  p  linearly 
independent  solutions. 

We  guess  that  the  solutions  are  of  the  form  X,  =  z~'  where  z  may  be  a  complex 
number.  If  this  guess  is  right  then  we  must  have  for  t  =  0: 


1-tpiZ  -  <j>Pz?  =  0. 

This  equation  is  called  the  characteristic  equation ?  Thus  z  must  be  a  root  of  the 
polynomial  <F(z)  =  1  —  0iz  —  ...  —  4>pf .  From  the  fundamental  theorem  of  algebra 
we  know  that  there  are  exactly  p  roots  in  the  field  of  complex  numbers.  Denote  these 
roots  by  zi _ ,zP. 

Suppose  that  these  roots  are  different  from  each  other.  In  this  case 
. . . ,  {Zp1}}  constitutes  a  set  of  p  linearly  independent  solutions.  To  show 
this  it  is  sufficient  to  verify  that  the  determinant  of  the  matrix 


W 


{  1 

2?1 
2 


1  \ 


2 


„-2 


\J1 


■p+ 1  p+l 


.  .  .  Z, 


-P+1 


is  different  from  zero.  This  determinant  is  known  as  Vandermonde’s  determinant 
and  is  equal  to  det  W  =  rii<;<y<pfe  —  %')■  This  determinant  is  clearly  different  from 
zero  because  the  roots  are  different  from  each  other.  The  general  solution  to  the 
difference  equation  therefore  is 


X,  —  c\Z\’ +  ■  ■  ■  +  CpZp'  (B.l) 

where  the  constants  c\,...,cp  are  determined  from  the  starting  values  (initial 
conditions). 

In  the  case  where  some  roots  of  the  characteristic  polynomial  are  equal,  the 
general  solution  becomes  more  involved.  Let  zi,  ■  ■  ■  ,zr,  r  <  p,  be  the  roots  which 


2Sometimes  one  can  find  zp  —  4>\zP  1  —  ...  —  <f>p  =  0  as  the  characteristic  equation.  The  roots  are 
of  the  two  characteristic  equations  are  then  reciprocal  to  each  other. 
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are  different  from  each  other  and  denote  their  corresponding  multiplicities  by 
m  i, . . . ,  mr.  It  holds  that  Y^j=  i  =  P ■  The  general  solution  is  then  given  by 

r 

X,  =  ( CjO  +  Cj\t  +  .  .  .  +  Cj,nj_ lflj  ')  Zj  '  (B.2) 

7=1 

where  the  constants  c;i-  are  again  determined  from  the  starting  values  (initial 
conditions). 
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This  appendix  presents  the  relevant  concepts  and  theorems  from  probability  theory. 
The  reader  interested  in  more  details  should  consult  corresponding  textbooks,  for 
example  Billingsley  (1986),  Brockwell  and  Davis  (1991),  Hogg  and  Craig  (1995), 
or  Kallenberg  (2002)  among  many  others. 

In  the  following,  all  real  random  variables  or  random  vectors  X  are  defined  with 
respect  to  some  probability  space  (f2 , 21,  P).  Thereby,  £2  denotes  an  arbitrary  space 
with  a-field  21  and  probability  measure  P.  A  random  variable,  respectively  random 
vector,  X  is  then  defined  as  a  measurable  function  from  $2  to  R,  respectively  R".  The 
probability  space  £2  plays  no  role  as  it  is  introduced  just  for  the  sake  of  mathematical 
rigor.  The  interest  rather  focuses  on  the  distributions  induced  by  P  o  X-1. 

We  will  make  use  of  the  following  important  inequalities. 

Theorem  C.l  (Cauchy-Bunyakovskii-Schwarz  Inequality).  For  any  two  random 
variables  X  and  Y, 


|E(xy)|  <  Vex2  Vet2. 


The  equality  holds  if  and  only  ifX  =  Y. 

Theorem  C.2  (Minkowski’s  Inequality).  Let  X  and  Y  be  two  random  variables  with 
E|X|2  <  oo  andE\Y\2  <  oo,  then 


(E|x  +  r|2)1/2  <  (e|x|2)1/2  +  (E|y|2)1/2 


Theorem  C.3  (Chebyschev’s  Inequality).  IfE\X\r  <  oo  for  r  >  0  then  for  every 
r  >  0  and  any  e  >  0 


P[|X|  >  e]  <  e-TEprr. 
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Theorem  C.4  (Borel-Cantelli  Lemma).  Let  A\,Ao_ _ e  21  be  an  infinite  sequence 

of  events  in  some  probability  space  (£2,21,  P)  such  that  ^A.=1  PLU)  <  oo.  Then, 
P{A*  i.o.}  =  0.  The  event  { A k  i.o.}  is  defined  by  { A k  i.o.}  =  lim  sup^A*}  = 
f\=]  [fij=kAj  where  i.o.  stands  for  infinitely  often. 

On  several  occasions  it  is  necessary  to  evaluate  the  limit  of  a  sequence  of 
random  variables.  In  probability  theory  several  concepts  of  convergence  are  dis¬ 
cussed:  almost  sure  convergence,  convergence  in  probability,  convergence  in  r-th 
mean  ( convergence  in  quadratic  mean),  convergence  in  distribution.  We  only  give 
definitions  and  the  most  important  theorems  leaving  an  in-depth  discussion  to  the 
relevant  literature.  Although  not  explicitly  mentioned,  many  of  the  theorem  below 
also  hold  in  an  analogous  way  in  a  multidimensional  context. 

Definition  C.l  (Almost  Sure  Convergence).  For  random  variables  X  and  J X, J 
defined  on  the  same  probability  space  (£2 , 21,  P),  we  say  that  {X,}  converges  almost 
surely  or  with  probability  one  to  X  if 

P  j&>  €  £2  :  lim  Xr(a>)  =  X(m)  j  =  1. 

a.s. 

This  fact  is  denoted  by  X, - >  X  or  lim  X,  =  Xa.s. 

Theorem  C.5  (Kolmogorov’s  Strong  Law  of  Large  Numbers  (SLLN)).  Let 
X, X\,X2, . . .  be  identically  and  independently  distributed  random  variables.  Then, 
the  arithmetic  average  Xj  =  j  \  X,  converges  almost  surely  to  EX  if  and  only 
ifE|X|  <  oo. 

Definition  C.2  (Convergence  in  Probability).  For  random  variables  X  and  {X,} 
defined  on  the  same  probability  space,  we  say  that  {X,}  converges  in  probability 
to  X  if 

lim  P[|X,  —  X|  >  e]  =  0  for  all  e  >  0. 

/— >oo 

P 

This  fact  is  denoted  by  X,  — >  X  or  plimX,  =  X. 

Remark  C.L  If  X  and  {X,}  are  real  valued  random  vectors,  we  replace  the  absolute 
value  in  the  definition  above  by  the  Euclidean  norm  || .  || .  This  is,  however,  equivalent 
to  saying  that  every  component  X,>  converges  in  probability  to  X,-,  the  i-th  component 
ofX. 

Definition  C.3  (Convergence  in  r-th  Mean).  A  sequence  {Xf}  of  random  variables 
converges  in  r-th  mean  to  a  random  variable  X  if 

lim  E(|X,  -  X|r)  =  0  for  r  >  0. 

t-*o o 
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r 

We  denote  this  fact  by  X,  — >  X.  If  r  =  1  we  say  that  the  sequence  converges 
absolutely;  and  if  r  =  2  we  say  that  the  sequence  converges  in  mean  square  which 

m.s. 

is  denoted  by  Xt - >  X. 

Remark  C.2.  In  the  case  r  =  2,  the  corresponding  definition  for  random  vectors  is 
lim  E(||X,  —  X\\2)  =  lim  E(X,  -  X)'(X,  -  X)  =  0. 

t— >oo  I— >oo 

Theorem  C.6  (Riesz-Fisher).  Let  { X,  J  be  a  sequence  of  random  variables  such 
sup,  E|Xf  |2  <  oo.  Then  there  exists  a  random  variable  X  with  E|X|2  <  oo  such  that 

X, — —+ X  if  and  only  if  E|Z,  —  Xs|2  — »  0  fort,s^roo. 

This  version  of  the  Riesz-Fisher  theorem  provides  a  condition,  known  as  the 
Cauchy  criterion,  which  is  often  easier  to  verify  when  the  limit  is  unknown. 

Definition  C.4  (Convergence  in  Distribution).  A  sequence  {X,}  of  random  vectors 
with  corresponding  distribution  functions  {Fx,}  converges  in  distribution,  if  there 
exists  an  random  vector  X  with  distribution  function  Fx  such  that 

lim  Fx  (x)  =  Fx(x)  for  all  x  e  C 

t— >oo 

where  C  denotes  the  set  of  points  for  which  Fx{x)  is  continuous.  We  denote  this  fact 

by  X,  -4  X. 

Note  that,  in  contrast  to  the  previously  mentioned  modes  of  convergence, 
convergence  in  distribution  does  not  require  that  all  random  vectors  are  defined  on 
the  same  probability  space.  The  convergence  in  distribution  states  that,  for  large 
enough  t,  the  distribution  of  X,  can  be  approximated  by  the  distribution  of  X. 

The  following  Theorem  relates  the  four  convergence  concepts. 

Theorem  C.7.  (i)  IfX,  X  then  X,  X. 

(ii)  IfX,  X  then  there  exists  a  subsequence  {Xf„}  such  thatX,n  - >  X. 

...  r  p 

(iii)  IfX,  — >-  X  thenX,  — >  X  by  Chebyschev’s  inequality  (Theorem  C.3). 

(iv)  IfX,  A  X  then  X,  -4-  X. 

d  P 

(v)  IfX  is  a  fixed  constant,  then  X,  — >  X  implies  X,  — >  X.  Thus,  the  two  concepts 
are  equivalent  under  this  assumption. 
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These  facts  can  be  summarized  graphically: 

a.s. 

a.s.  p 

X,  - >  X  =*  X,  — ^  X  : 

n- 

z,  -Az 


X,  — ^z 


A  further  useful  theorem  is: 

Theorem  C.8.  If  EX,  — >  //  and  VX,  — >  0  then  X,  — A  p  and  consequently 

A  //. 

Theorem  C.9  (Continuous  Mapping  Theorem).  For  any  continuous  function 
f  :  R"  — >  R™  and  random  vectors  {Zf}  and  X  defined  on  some  probability  space, 
the  following  implications  hold: 

(i)  X, - X  implies  f(Xt)  ■---->  /(Z). 

(ii)  X,  A  X  implies  f(Xt)  A  f(X). 

(iii)  X,  A  X  implies  f(Xt)  A  f(X). 

An  important  application  of  the  Continuous  Mapping  Theorem  is  the  so-called 
Delta  method  which  can  be  used  to  approximate  the  distribution  of  f(Xt)  (see 
Appendix  E). 

A  further  useful  result  is  given  by: 

Theorem  C.10  (Slutzky’s  Lemma).  Let  { X,  j  and  {  Y,  j  be  two  sequences  of  random 

d  d 

vectors  such  that  X,  — >  X  and  Y,  — >  c,  c  constant,  then 

(i)  X,  +  Y,  — ^A  X  +  c, 

(ii)  Y[X,  — A  c'X. 

(iii)  X,/Y, - >  X/c  ifc  is  a  nonzero  scalar. 

Like  the  (cumulative)  distribution  function,  the  characteristic  function  provides 
an  alternative  way  to  describe  a  random  variable. 

Definition  C.5  (Characteristic  Function).  The  characteristic  function  of  a  real 
random  vector  X,  denoted  by  ipx,  is  defined  by 

<px(s)  =  Ee'A  A  e  R", 

where  i  is  the  imaginary  unit. 

If,  for  example,  X  ~  N(/x,  tr2),  then  <px(s)  =  exp(/,v/i  —  \o2s2).  The  characteris¬ 
tic  function  uniquely  determines  the  distribution  of  X.  Thus,  if  two  random  variables 
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have  the  same  characteristic  function,  they  have  the  same  distribution.  Moreover, 
convergence  in  distribution  is  equivalent  to  convergence  of  the  corresponding 
characteristic  functions. 

Theorem  C.ll  (Convergence  of  Characteristic  Functions,  Levy).  Let  {X,}  be  a 
sequence  of  real  random  variables  with  corresponding  characteristic  functions  i px, 
then 


X, - »  X  if  and  only  if  lim  tpxA A)  =  cpx(A ),  for  all  A  e  R". 

t— too 

In  many  cases  the  limiting  distribution  is  a  normal  distribution.  In  which  case 
one  refers  to  the  asymptotic  normality. 

Definition  C.6  (Asymptotic  Normality).  A  sequence  of  random  variables  {X,} 
with  “means”  ptt  and  “variances”  of  >  0  is  said  to  be  asymptotically  normally 
distributed  if 


af\X,  -  pt,)  ----->  X  ~  N(0, 1). 

Note  that  the  definition  does  not  require  that  /x,  =  EX,  nor  that  cr2  =  V(X,). 
Asymptotic  normality  is  obtained  if  the  X,’s  are  identically  and  independently 
distributed  with  constant  mean  and  variance.  In  this  case  the  Central  Limit  Theorem 
(CLT)  holds. 

Theorem  C.12  (Central  Limit  Theorem).  Let  {X,}  be  a  sequence  of  identically 
and  independently  distributed  random  variables  with  constant  mean  p  and  constant 
variance  o2  then 


Vf  — — -  — d—+  N(0, 1), 

or 

where  Xj  =  T~l  Y'JL,  X,  is  the  arithmetic  average. 

It  is  possible  to  relax  the  assumption  of  identically  distributed  variables  in  various 
ways  so  that  there  exists  a  variety  of  CLT’s  in  the  literature.  For  our  purpose  it  is 
especially  important  to  relax  the  independence  assumption.  A  natural  way  to  do  this 
is  by  the  notion  of  m-dependence. 

Definition  C.7  (m-Dependence).  A  strictly  stationary  random  process  {X,}  is  called 
m-dependent  /or  some  nonnegative  integer  m  if  and  only  if  the  two  sets  of  random 
variables  {Xr,  r  <t]  and  {XT,  r  >  t  +  m  +  1}  are  independent. 

Note  that  for  such  processes  T  (j)  =  0  for  j  >  m.  This  type  of  dependence  allows 
to  proof  the  following  generalized  Central  Limit  Theorem  (see  Brockwell  and  Davis 
1991). 

Theorem  C.13  (CLT  for  m-Dependent  Processes).  Let  {X,}  be  a  strictly  stationary 
mean  zero  m-dependent  process  with  autocovariance  function  T  (/?)  such  that  V,,,  = 

T!h=-m  r  (h)  0  then 
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(i)  lim^oo  TV(Xt)  =  Yffl  and 

(ii)  VTxt  is  asymptotically  normal  N(0,  Vm). 

Often  it  is  difficult  to  derive  the  asymptotic  distribution  of  { X,  j  directly.  This 
situation  can  be  handled  by  approximating  the  original  process  {X,}  by  a  process 
ixl'"1  j  which  is  easier  to  handle  in  terms  of  its  asymptotic  distribution  and  where 
the  precision  of  the  approximation  can  be  “tuned”  by  the  parameter  m. 

Theorem  C.14  (Basis  Approximation  Theorem).  Let  {A,}  and  J  A,1™1 j  be  two 
random  vectors  process  such  that 

(i)  x\'n) - »  X(m)  as  t  — >  oo  for  each  m  =  1,2,..., 

(ii)  A(m' - ^  X  as  m  — >  oo,  and 

( Hi)  limm^oo  lim  sup^^  P[| X,  —  x\'n)  \  >  e]  =  0  for  every  e  >  0. 

Then 

d 

X, - >  X  as  t  — >  oo. 


Beveridge-Nelson  Decomposition 


The  Beveridge-Nelson  decomposition  proves  to  be  an  indispensable  tool.  Based  on 
the  seminal  paper  by  Phillips  and  Solo  (1992),  we  proof  the  following  Theorem 
for  matrix  polynomials  where  ||.||  denotes  the  matrix  norm  (see  Definition  10.6  in 
Chapter  10).  The  univariate  version  is  then  a  special  case  with  the  absolute  value 
replacing  the  norm. 

Theorem  D.l.  Any  a  lag  polynomial  'T(L)  =  Ylj=o  'T;L/  where  \P,-  are  n  x  n 
matrices  with  T'o  =  I„  can  be  represented  by 

®(L)  =  f(l)-(/„-L)$(L)  (D.l) 

where  'P(L)  =  =  5ZS/+  i  Moreover, 

oo  oo 

Yfm2  <  00  implies  ^  ||'P,-||2  <  oo  and  ||'P(1)||  <  oo. 

7=1  7=0 

Proof.  The  first  part  of  the  Theorem  is  obtained  by  the  algebraic  manipulations 
below: 

'P(L)  -  ip(l)  =  In  +  «r,L  +  Ti2L2  +  . . . 

2  -  . . . 

=  *1  (L  -  In)  +  T/2(L2  -  In)  +  ^(L3  -/„)+... 

=  (L  —  OT]  +  (L  —  OTifL  +  /„) 

+  (L  —  /^(L2  +  L  +  /„)+.. . 
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D  BN-Decomposition 


—  ~{In  ~  L)((\Pi  +  ^2  +  ^3  +  •  •  •)  + 

^  V 

(^2  +  *3  +  •  ■  • )  L  +  (¥3  +  ■  •  ■)  L2  +  .  .  .) 

- - v - •  ' - v - • 

q<2 


Taking  any  5  €  (1/2, 1),  the  second  part  of  the  Theorem  follows  from 


En^ii2  =  E 


2=0 


2=0 


i=j+ 1 


oo  /  oo 

EE 

2=0  V=2+l 


I'P; 


nm\r 


=e(e 

7=0  \i=j+ 1 
oo  /  oo 

^  E  E  iMn*J2 1 1  E r 


■2(5 


2=0  \i=j+ 1 


ti=2+l 


OOIOO 


<  {28  -  l)-1  ^  (M||T,||2  J/~m 

2=0  V=2+l 

OO  /  i— 1 

=  (25  -  l)-1  E  E-/1"2*  |  ^IW2 

i=0  \  2=0 


<  [(25  —  1)(2  —  25)]— 1  E/2"5 1 l^'ll2/ 


■2—25 


2=0 


=  [(25  —  1)(2  —  25)]_1  E-12|I'^I'II2  < 
2=0 


OO. 


The  hrst  inequality  follows  from  the  triangular  inequality  for  the  norm.  The  second 
inequality  is  Holder’s  inequality  (see,  for  example,  Naylor  and  Sell  1982;  p.  548) 
with  p  =  q  =  2.  The  third  and  the  fourth  inequality  follow  from  the  Lemma  below. 
The  last  inequality,  finally,  follows  from  the  assumption. 


D  BN-Decomposition 
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The  last  assertion  follows  from 

OO  OO 

n*(i)ii  <E  ii^ii  =  M  + Emit1 

<00. 

The  last  inequality  is  again  a  consequence  of  Holder’s  inequality.  The  summability 
assumption  then  guarantees  the  convergence  of  the  first  term  in  the  product. 
Cauchy’s  condensation  test  finally  establishes  the  convergence  of  the  last  term.  □ 


II In  II  +  [  y/ 


PL 


Lemma  D.l.  The  following  results  are  useful: 

(i)  For  any  b  >  0,  i~l~b  <  b~xj~b. 

(ii)  For  any  c  e  (0, 1),  J2j=  i  f~ '  —  c~lic. 

Proof.  Let  k  be  a  number  greater  than  j.  then  k  1  b  <  j  1  *  and 

k~x~b  <  f  j~'~hdj  =  b~\k  -  l)-6  -  b~lk~b. 

Jk- 1 

This  implies  that  Y-j°=j+\  k~^~b  <  b~  lj  b .  This  proves  part  (i)  by  changing  the 
summation  index  back  from  k  to  j.  Similarly,  kc~x  <  f  ~ 1  and 

Ck 

If-1  <  /  jc-'dj  =  c~xkc  -  c~x(k  -  1  )c. 

Jk- 1 

Therefore  —  c~lic  which  proves  part  (ii)  by  changing  the  summation 

index  back  from  k  to  j.  □ 

Remark  D.l.  An  alternative  common  assumption  is  <  °°-  It  is. 

however,  easy  to  see  that  this  assumption  is  more  restrictive  as  it  implies  the  one 
assumed  in  the  Theorem,  but  not  vice  versa.  See  Phillips  and  Solo  (1992)  for  more 
details. 


The  Delta  Method 


E 


It  is  often  the  case  that  it  is  possible  to  obtain  an  estimate  fj  of  some  parameter 
P,  but  that  one  is  really  interested  in  a  function  /  of  /.  The  Continuous  Mapping 
Theorem  then  suggests  to  estimate  /'(/)  by  /(/V).  But  then  the  question  arises  how 
the  distribution  of  is  related  to  the  distribution  of  f0T). 

Expanding  the  function  into  a  first  order  Taylor  approximation  allows  to  derive 
the  following  theorem. 

Theorem  E.l.  Let  { ftj]  be  a  K-dimensional  sequence  of  random  variables  with  the 
property  s/T(Pt  ~  P) - >  N(0.  E)  then 

V T  (j f{pT )  -m)  — N  (0,  V/(/J)  E  Wf(py) . 

where  f  :  — >  R7  is  a  continuously  differentiable  function  with  Jacobian  matrix 

(matrix  of  first  order  partial  derivatives)  V/(/)  =  df(P)/dP'. 

Proof  See  Serfling  (Serfling  1980;  122-124).  □ 

Remark  E.l.  In  the  one-dimensional  case  where  VT(Pt  ~  P) - >  N(0,  a2)  and 

/  :  R  — >  R  the  above  theorem  becomes: 

Vf(f(pT)  -m)  — ^  n  (o,  \fm2o2) 

where  /'(/)  is  the  first  derivative  evaluated  at  f. 
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E  The  Delta  Method 


Remark  E.2.  The  JxK  Jacobian  matrix  of  first  order  partial  derivatives  is  defined  as 

v m  =  am/ dp 


/M)  3/1  W\ 

'  3/3 1  •••  3/3*  4 


.  vm 

\  3/3 1 


mu  , 

3/3  A'  / 


Remark  E.3.  In  most  applications  /3  is  not  known  so  that  one  evaluates  the  Jacobian 
matrix  at  fix- 


Example:  Univariate 

Suppose  we  have  obtained  an  estimate  of  /3  equal  to  /i  =  0.6  together  with  an 
estimate  for  its  variance  a?  =  0.2.  We  can  then  approximate  the  variance  of/(/3)  = 

l/P  =  1.667  by 


wm  = 


1.543. 


Example:  Multivariate 


In  the  process  of  computing  the  impulse  response  function  of  a  VAR(l)  model  with 

<t>  =  (? 1 1  ^ 1 2  ]  one  has  to  calculate  'T2  =  <E>2.  If  we  stack  all  coefficients  of  $  into 
V02I  022/ 

a  vector /l  =  vec(5>)  =  (0n.  02i ,  <pn,  (pn)'  then  we  get: 


/(/?)  =  vec  4^2  =  vec  <k2 


(^n\ 

02? 

0i? 

V02?/ 


^  0U  +  012021  ^ 

011021  +  021022 
011012  +  012022 
V  012021  +  022  / 


where  'k2  = 


The  Jacobian  matrix  then  becomes: 


/gjff  30i(? 

9011  9021 

a  .  (2)  o  ,,(2) 

9021  9021 
9011  9021 


30<? 


30l<? 


30ii  3021 


3  02? 


30® 


301?  3011  \ 
9012  9022 

o  .  (2)  „  ,  (2) 

9021  9021 
9012  9022 

301?  30!? 

3012  3022 

302?  30<? 


ZZ  f  ZZ  r  ZZ  r  ZZ  I 

9011  9021  9012  9022  / 


/  2011 

012 

021 

0  ^ 

021 

011  +  022 

0 

021 

012 

0  0i: 

1  +  022 

012 

V  0 

012 

021 

2022/ 

v/(/D  = 
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In  Section  15.4.4  we  obtained  the  following  estimate  for  a  VAR(l)  model  for 
{Xt}  =  {(ln(A,),  ln(Sj))'}: 


Xt  —  c  +  +  Z[  — 


/-0.14A  /  0.316  0.640\  £ 

\  0.499/  V-°-202  1-117/ 


The  estimated  covariance  matrix  of  vec  <t>,  V(vec  d>),  was: 


V(/§)  =  V(vec4>)  = 


/  0.0206  0.0069  -0.0201  -0.0067X 
0.0069  0.0068  -0.0067  -0.0066 
-0.0201  -0.0067  0.0257  0.0086 
V— 0.0067  -0.0066  0.0086  0.0085/ 


We  can  then  approximate  the  variance  of f(/3)  =  vec(<t>2)  by 

V(/'(vec  <J>))  =  V(vec  62)  =  V/(ve c  4>)|4>=i  V(vec  <J>)  V/(ve c 


This  leads  : 


/  0.0245  0.0121  -0.0245  -0.01 19\ 
0.0121  0.0145  -0.0122  -0.0144 

-0.0245  -0.0122  0.0382  0.0181 
V-0.01 19 -0.0144  0.0181  0.0213/ 


V(/(vec  <E>))  = 
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autocorrelation  function,  29 
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stationary  solution,  29 
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ARMA  model 
estimation,  87 
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ARMA  process,  see  also  Autoregressive 
moving-average  process 
autocovariance  function,  38 
causality,  32 
causality  condition,  33 
estimation,  95 
invertibility,  37 
invertibility  condition,  37 
maximum  likelihood  estimation,  95 
state  space  representation,  330 
Autocorrelation  function,  14 
confidence  interval 
MA(q)  process,  76 
AR(1)  process,  77 
estimation,  73 

asymptotic  distribution,  74 
Bartlett’s  formula,  74 
confidence  interval,  75 
inteipretation,  64 
order,  14 
properties,  21 
random  walk,  144 
univariate,  14 

Autocorrelation  function,  partial,  62 
AR  process,  63 


estimation,  78 
interpretation,  64 
MA  process,  52,  64 
Autocovariance  function,  13 
ARMA  process,  38 
estimation,  73 
linear  process,  124 
MA(1)  process,  21 
multivariate,  202 
order,  13 
properties,  20 
random  walk,  144 
univariate,  13 

Autoregressive  conditional  heteroskedasticity 
models,  see  Volatility 
Autoregressive  final  form,  223 
Autoregressive  moving-average  process,  25 
Autoregressive  moving-average  prozess 
mean,  25 
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Back-shift  operator,  see  also  Lag  operator 
Bandwidth,  80 
Bartlett’s  formula,  74 
Basic  structural  model,  332,  349 
cylical  component,  333 
local  linear  trend  model,  333 
seasonal  component,  333 
Bayesian  VAR,  253 

Beveridge-Nelson  decomposition,  138,  383 
Bias  proportion,  250 
Bias,  small  sample,  92,  231 
correction,  92,  231 

BIC,  see  Information  criterion,  101,  see 
Information  criterion,  247 
Borel-Cantelli  lemma,  377 
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Canonical  correlation  coefficients,  315 
Cauchy-Bunyakovskii-Schwarz 
inequality,  377 
Causal  representation,  32 
Causality,  see  also  Wiener-Granger 
causality,  328 

Wiener-Granger  causality,  255 
Central  Limit  Theorem 
m-dependence,  381 
Characteristic  function,  380 
Chebyschev’s  inequality,  377 
Chow  test,  355 
Cointegration,  159 

Beveridge-Nelson  decomposition,  304, 
309 

bivariate,  159 

common  trend  representation,  310 
definition,  305 

fully-modified  OLS,  319,  323 

Granger’s  representation  theorem,  309 

normalization,  323 

order  of  integration,  303 

shocks,  permanent  and  transitory,  311 

Smith-McMillan  factorization,  306 

test 

Johansen  test,  312 
regression  test,  161 
triangular  representation,  3 1 1 
VAR  model,  305 
assumptions,  305 
VECM,  307 

vector  error  correction,  307 
Wald  test,  321 
Companion  form,  218 
Convergence 

Almost  sure  convergence,  378 
Convergence  in  r-th  mean,  378 
Convergence  in  distribution,  379 
Convergence  in  probability,  378 
Correlation  function,  202 
estimator,  208 
multivariate,  202 
Covariance  function 
estimator,  208 
properties,  203 
covariance  function,  202 
Covariance  proportion,  250 
Covariance,  long-run,  209 
Cross-correlation,  203 

distribution,  asymptotic,  209 
Cyclical  component,  128,  333 


D 

Dickey-Fuller  distribution,  142 
Durbin-Levinson  algorithm,  48,  63 
Dynamic  factor  model,  335 
Dynamic  multiplier,  see  Shocks,  transitory 
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EM  algorithm,  345 
Ergodicity,  10,  69 
Estimation 

ARMA  model,  95 
order,  99 
Estimator 

maximum  likelihood  estimator,  95,  96 
method  of  moments 

GARCH(  1,1)  model,  187 
moment  estimator,  88 
OLS  estimator,  91 

process,  integrated,  141 
Yule- Walker  estimator,  88 
Example 

AD-curve  and  Money  Supply,  260 
advertisement  and  sales,  274 
ARMA  processes,  34 
cointegration 

fully-modified  OLS,  323 
Johansen  approach,  321 
consumption  expenditure  and 
advertisement,  212 
demand  and  supply  shocks,  287 
estimation  of  long-run  variance,  83 
estimation  of  quarterly  GDP,  346 
GDP  and  consumer  sentiment  index,  213 
growth  model,  neoclassical,  323 
inflation  and  short-term  interest  rate,  162 
IS-LM  model  with  Phillips  curve,  277 
modeling  real  GDP  of  Switzerland,  103 
present  discounted  value  model,  296 
structural  breaks,  356 
Swiss  Market  Index,  188 
term  structure  of  interest  rate,  164 
unit  root  test,  1 52 
Expectation,  adaptive,  59 
Exponential  smoothing,  58 
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Factor  model,  dynamic,  see  Dynamic  factor 
model 

FEVD,  see  also  Forecast  error  variance 
decomposition 


Index 


405 


Filter 

gain  function,  1 26 
Gibbs  phenomenon,  128 
high-pass,  128 
Hodrick-Prescott  filter,  128 
HP-filter,  128 
Kuznets  filter,  126 
low-pass,  127 
phase  function,  126 
TRAMO-SEATS,  131 
transfer  function,  1 25 
X-ll  filter,  131 
X- 12-Filter,  131 
Filter,  time  invariant,  122 
Filtering  problem,  336 
Final  fonn,  see  Autoregressive  final  form 
FMOLS  estimator,  319,  323 
Wald  test,  321,324 

Forecast  error  variance  decomposition,  270 
Forecast  evaluation 
Bias  proportion,  250 
Covariance  proportion,  250 
Mean-absolute-error,  249 
Out-of-sample  strategy,  250 
Root-mean-squared-error,  249 
Uncertainty,  251 
Variance  proportion,  250 
Forecast  function,  45 
AR(p)  process,  48 
ARMA(1,1)  process,  53 
forecast  error,  47 
infinite  past,  53 
linear,  45 

MA(q)  process,  50 
variance  of  forecast  error,  48 
Forecast,  direct,  253 
Forecast,  iterated,  244,  250,  253 
Fourier  frequencies,  117 
Fourier  transform,  discrete,  118 
FPE,  see  Information  criterion,  248 
Frequency  domain,  109 
Fully-modified  ordinary  least-squares,  319 


G 

Gain  function,  1 26 
Gauss  Markov  theorem,  92 
Growth  component,  128 

H 

HAC  variance,  see  also  variance, 

heteroskedastic  and  autocorrelation 
consistent 


Harmonic  process,  54,  115 
Hodrick-Prescott  filter,  128 
HQC,  see  Infonnation  criterion,  101,  see 
Information  criterion,  247 

I 

Identification 

Box-Jenkins,  64 
Kalman  filter,  346 
Identification  problem,  262 
Impulse  response  function,  32,  37 
Information  criterion,  101,  247 
AIC,  101,247 
BIC,  101, 247 
Final  prediction  error,  248 
FPE,  248 

Hannan-Quinn,  247 
HQC,  101 
Schwarz,  101,247 
Innovation  algorithm,  48 
Innovations,  56 
Integrated  GARCH,  181 
Integrated  process,  102 
Integrated  regressors 
rules  of  thumb,  162 
Integration,  order  of,  134 
Intercept  correction,  253 
Invertibility,  37 

J 

Johansen  test 

distribution,  asymptotic,  318 
hypothesis  tests  over  /J,  318 
max  test,  316 

specification  of  deterministic  part,  317 
trace  test,  316 

K 

Kalman  filter,  339 
application 

basic  structural  model,  349 
estimation  of  quarterly  GDP,  346 
AR(1)  process,  337,  342 
assumptions,  327 
causal,  328 
EM  algorithm,  345 
filtering  problem,  336 
forecasting  step,  339 
gain  matrix,  340 
identification,  346 
initialization,  340 
likelihood  function,  344 
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Kalman  filter  ( cont .) 

Markov  property,  327 
measurement  errors,  342 
observation  equation,  326 
prediction  problem,  336 
smoother,  341 
smoothing  problem,  336 
stable,  328 
state  equation,  326 
stationarity,  328 
updating  step,  339 
Kalman  smoother,  341 
Kernel  function,  80 
bandwidth,  80 
optimal,  82 
rule  of  thumb,  82 
Bartlett,  80 
boxcar,  80 
Daniell,  80 

lag  truncation  parameter,  80 
optimal,  82 
quadratic  spectral,  80 
Tukey-Hanning,  80 

L 

Lag  operator,  26 

calculation  rules,  26 
definition,  26 
polynomial,  26 
Lag  polynomial,  26 
Lag  truncation  parameter,  80 
Lag  window,  117 
Lead  operator,  26 
Leading  indicator,  213,  259 
Least-squares  estimator,  91,  97 
Likelihood  function,  95,  344,  365 
ARMA  process,  95 
Kalman  filter,  344 
regime  switching  model,  365 
Ljung-Box  statistic,  75 
Loading  matrix 
definition,  306 

Local  linear  trend  model,  333 
Long-run  identification,  285 
instrumental  variables,  286 


M 

m-dependence,  381 
MA  process,  17,  27 

autocorrelation  function,  28 
autocovariance  function,  21,  27 
MAE,  249 


Markov  chain,  364 

ergodic  distribution,  365 
regular,  365 
Matrix  norm,  205 

absolute  summability,  206 
quadratic  summability,  206 
submultiplicativity,  206 
Max  share  identification,  see  also  VAR 
process,  see  also  VAR  process 
Maximum  likelihood  estimator,  96,  184 
ARMA(p,q)  model,  95 
asymptotic  distribution,  98 
AR  process,  98 
ARMA(1,1)  process,  99 
MA  process,  99 
GARCH(p,q)  model,  186 
Maximum  likelihood  method,  95 
Mean,  67,  207 

asymptotic  distribution,  69,  7 1 
distribution,  asymptotic,  208 
estimation,  67,  207 
estimator,  208 
Mean  reverting,  133 
Mean  squared  error  matrix 
estimated  coefficients,  247 
known  coefficients,  244 
Measurement  errors,  337 
Median-target  method,  293 
Memory,  short,  28 
Minnesota  prior,  253,  361 
Missing  observations,  331 
Mixture  distributions,  365 
Model,  10 

N 

Normal  distribution,  multivariate 
conditional,  337 
Normal  equations,  46 

O 

Observation  equation,  326 
Observationally  equivalent,  262 
OLS  estimator,  91 

distribution,  asymptotic,  92 
Order  of  integration,  134 
Ordinary-least-squares  estimator,  91 
Oscillation  length,  111 
Overfitting,  99 

P 

PACF,  see  also  Autocorrelation  function, 
partial 
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Partial  autocorrelation  function 
computation,  63 
estimation,  78 
Particle  filter,  360 
Penalty  function  approach,  293 
Period  length,  1 1 1 
Periodogram,  117 
Perpetuity,  357 
Persistence,  137 
Phase  function,  1 26 
Portmanteau  test,  76 
PP-test,  149 

Prediction  problem,  336 
Predictor,  see  also  Forecast  function 
Present  discounted  value  model,  296 
Beveridge-Nelson  decomposition,  301 
cointegration,  299 
spread,  297 

VAR  representation,  298 
vector  error  correction  model,  298 
Prewhitening,  83 
Process,  ARIMA,  134 
Process,  stochastic,  7,  201 
ARMA  process,  25 
branching  process,  1 1 
deterministic,  54 
difference-stationary,  134 
finite  memory,  1 8 
finite-range  dependence,  1 8 
Gaussian  process,  15 
harmonic  process,  54 
integrated,  102,  134,  303 

Beveridge-Nelson  decomposition, 
138 

forecast,  long-run,  135 
impulse  response  function,  137 
OLS  estimator,  141 
persistence,  137 
variance  of  forecast  error,  136 
linear,  204 
linearly  regular,  57 
memory,  15 

moving-average  process,  17 
multivariate,  201 
purely  non-deterministic,  57 
random  walk,  19 
random  walk  with  drift,  19 
singular,  54 

spectral  representation,  116 
trend-stationary,  134 
forecast,  long-run,  135 
impulse  response  function,  137 
variance  of  forecast  error,  136 
white  noise,  15 


R 

Random  walk,  11,  19 

autocorrelation  function,  144 
autocovariance  function,  144 
Random  walk  with  drift,  19 
Real  business  cycle  model,  336 
Realization,  9 

Regime  switching  model,  364 

maximum  likelihood  estimation,  365 
Restrictions 
long-run,  282 
short-run,  268 
sign  restrictions,  267 
RMSE,  249 


S 

Seasonal  component,  333 
Set  identified,  292 
Shocks 

fundamental,  57 
permanent,  37 
structural,  260 
transitory,  36 

Short  range  dependence,  see  also  Memory, 
short 

Signal-to-noise  ratio,  333 
Singular  values,  315 
Smoothing,  341 
Smoothing  problem,  336 
Smoothing,  exponential,  58 
Spectral  average  estimator,  discrete,  1 1 8 
Spectral  decomposition,  109 
Spectral  density,  110,  115 
ARMA  process,  121 
autocovariance  function,  1 1 1 
estimator,  direct,  117 
estimator,  indirect,  117 
Fourier  coefficients,  1 1 1 
spectral  density,  rational,  122 
variance,  long-run,  117 
Spectral  distribution  function,  1 1 5 
Spectral  representation,  114,  116 
Spectral  weighting  function,  119 
Spectral  window,  119 
Spectrum  estimation,  109 
Spurious  correlation,  158 
Spurious  regression,  158 
State  equation,  326 
State  space,  9 

State  space  representation,  218,  326 
ARMA  processes,  330 
ARMA(1,1),  329 
missing  observations,  331 
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State  space  representation  (cont.) 
stationary,  328 

time-varying  coefficients,  331,  357 
Cooley-Prescott,  331 
Harvey-Phillips,  331 
Hildreth-Houck,  331 
VAR  process,  329 
Stationarity,  13 
multivariate,  202 
strict,  14 
weak,  13 
Stationarity,  strict 
multivariate,  203 

Strong  Law  of  Large  Numbers,  378 
Structural  breaks,  153,  252,  354 
Chow  test,  355 
dating  of  breaks,  356 
tests,  356 

Structural  change,  20 
Structural  time  series  analysis,  140 
Structural  time  series  model,  332,  349 
basic  structural  model,  332 
Summability 
absolute,  206 
quadratic,  206 
Summability  Condition,  383 
Superconsistency,  142 
Swiss  Market  Index  (SMI),  188 


T 

Test 

autocorrelation,  squared  residuals,  183 
cointegration 

regression  test,  161 
Dickey-Fuller  regression,  146 
Dickey-Fuller  test,  146,  147 
augmented,  148 
correction,  autoregressive,  148 
heteroskedasticity,  183 

Engle's  Lagrange-multiplier  test,  184 
Independence,  210 
Johansen  test,  312 

correlation  coefficients,  canonical,  315 

distribution,  asymptotic,  318 

eigenvalue  problem,  314 

hypotheses,  312 

hypothesis  tests  over  /l,  318 

likelihood  function,  315 

max  test,  316 

singular  values,  315 

trace  test,  316 

Kwiatkowski-Phillips-Schmidt-Shin-test, 

157 


Phillips-Perron  test,  146,  149 
stationarity,  157 
uncorrelatedness,  210 
unit  root  test 

structural  breaks,  153 
testing  strategy,  150 
unit-root  test,  146 
white  noise 

Box-Pierce  statistic,  75 
Ljung-Box  statistic,  75 
Portmanteau  test,  76 
Time,  8 

Time  domain,  109 
Time  series  model,  10 
Time-varying  coefficients,  331,  357 
Minnesota  prior,  361 
regime  switching  model,  364 
Times  series  analysis,  structural,  140 
Trajectory,  9 
Transfer  function,  125 
Transfer  function  form,  223 
Transition  probability  matrix,  364 


U 

Underfitting,  99 


V 

Value-at-Risk,  192 
VaR,  see  Value-at-Risk 
VAR  process 

Bayesian  VAR,  253 
correlation  function,  221 
covariance  function,  221 
estimation 

order  of  VAR,  247 
Yule- Walker  estimator,  238 
forecast  error  variance  decomposition,  270 
forecast  function,  241 

mean  squared  error,  243,  244 
form,  reduced,  261,  263 
form,  structural,  260,  263 
identification 

forecast  error  variance  share 
maximization,  272,  293 
long-run  identification,  282,  285 
short-run  identification,  268 
sign  restrictions,  267,  289 
zero  restrictions,  268 
identification  problem,  262,  264 
Cholesky  decomposition,  269 
impulse  response  function,  270 
bootstrap,  273 
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confidence  intervals,  272 
delta  method,  273 
state  space  representation,  329 
Structural  breaks,  252 
time-varying  coefficients,  357 
VAR(l)  process,  216 
stationarity,  216 
variance  decomposition 
confidence  intervals,  272 
Variance  proportion,  250 
Variance,  heteroskedastic  and  autocorrelation 
consistent,  72 
Variance,  long-run,  72,  209 
estimation,  79,  83 
prewhitening,  83 
multivariate,  209 
spectral  density,  117 
VARMA  process,  215 

causal  representation,  219 
condition  for  causal  representation,  219 
VECM,  see  also  Cointegration 
Vector  autoregressive  moving-average  process, 
see  also  VARMA  process 
Vector  autoregressive  process,  see  also  VAR 
process 

Volatility 

ARCH(p)  model,  173 
ARCH-in-mean  model,  176 
EG  ARCH  model,  176 
Forecasting,  182 
GARCH(  1,1)  model,  177 
GARCH(p,q)  model,  174 
ARMA  process,  175 


heavy-tail  property,  175 
GARCH(p,q)  model,  asymmetric,  176 
heavy-tail  property,  172 
IGARCH,  181 
models,  173 

TARCH(p,q)  model,  176 
time-varying,  360 

Wishart  autoregressive  process,  360 


W 

Weighting  function,  80 
White  noise,  15 
multivariate,  204 
univariate,  15 

Wiener-Granger  causality,  255 
test 

F-test,  257 

Haugh-Pierce  test,  260 
Wishart  autoregressive  process,  360 
Wold  Decomposition  Theorem,  55 
multivariate,  245 
univariate,  55 


Yule-Walker  equations 
multivariate,  221 
univariate,  88 
Yule- Walker  estimator,  88 
AR(1)  process,  89 
asymptotic  distribution,  89 
MA  process,  90 


